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Abstract: Consider the Gaussian sequence model y ~ A(0*,cr^/„), where 
6* is unknown but known to belong to a closed convex polyhedral set C C K”. 

In this paper we provide a unified characterization of the degrees of freedom 
for estimators of 6 * obtained as the (linearly or quadratically perturbed) par¬ 
tial projection of y onto C. As special cases of our results, we derive explicit 
expressions for the degrees of freedom in many shape restricted regression 
problems, e.g., bounded isotonic regression, multivariate convex regression 
and penalized convex regression. Our general theory also yields, as special 
cases, known results on the degrees of freedom of many well-studied estima¬ 
tors in the statistics literature, such as ridge regression. Lasso and generalized 
Lasso. Our results can be readily used to choose the tuning parameter(s) in¬ 
volved in the estimation procedure by minimizing the Stein’s unbiased risk 
estimate. We illustrate this through simulation studies for bounded isotonic 
regression and penalized convex regression. As a by-product of our analysis 
we derive an interesting connection between bounded isotonic regression and 
isotonic regression on a general partially ordered set, which is of independent 
interest. 
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1. Introduction 


Consider the Gaussian sequence model 


y = 


r+ e, 


( 1 ) 


where we observe y = (j/i,..., j/„) G M”, 6* = {91,, 0*) G M” is the unknown 
parameter of interest and e ~ N{0,a‘^ln) {In is the n x n identity matrix) is 
the unobserved error. We assume that 6* is known to belong to a given closed 
convex set C C MX. Let 0{y) := {6i,..., On) be an estimator of 0*. The “degrees 
of freedom” of 0{y) (see Efron (2004)) is dehned as 

1 ” ^ 

df(0(y)) := ^^Cov(0i,?/i). (2) 

a ^ 


Degrees of freedom (DF) is an important concept in statistical modeling and 
is often used to quantify the model complexity of a statistical procedure; see 
e.g., Meyer and Woodroofe (2000), Zou, Hastie and Tibshirani (2007), Tibshirani 
and Taylor (2012), and the references therein. Intuitively, the quantity df(0(y)) 
reflects the effective number of parameters used by 0{y) in producing the htted 
output, e.g., in linear regression, if 6{y) is the least squares estimator (LSE) 
of y onto a subspace of dimension d < n, the DF of 0{y) is simply d. Using 
Stein’s lemma it follows that (see Meyer and Woodroofe (2000) and Tibshirani 
and Taylor (2012)) 

df(g(y)) = E[£>(y)] 


where 


B(y) = div(§(y)) := 

i=l 



V,g(y) 


( 3 ) 


is called the divergence of 0{y). Thus, D{y) is an unbiased estimator of df(0(y)). 
This has many important implications, e.g., Stein’s unbiased risk estimate (SURE); 
see Stein (1981). Aside from plainly estimating the risk of an estimator, one could 
also use SURE for model selection purposes: if the estimator depends on a tun¬ 
ing parameter, then one could choose this parameter by minimizing SURE. This 
has been successfully used in many applications, see e.g., Donoho and Johnstone 
(1995) for an application in wavelet denoising, Mukherjee et ah (2015) for an 
example in reduced rank regression, Candes, Sing-Long and Trzasko (2013) for 
an application in singular value thresholding. We elaborate on this connection in 
Section 6. 


In this paper we develop a general theoretical framework to evaluate the di¬ 
vergence in (3) for a broad class of regression problems with special emphasis 
to shape restricted regression. Our general theory also recovers many existing 


Chen, X., Lin, Q. and Sen, B./On Degrees of Freedom of Projection Estimators 


3 


results, which include the exact expressions of the divergence for ridge regres¬ 
sion (see Li (1986)), Lasso and generalized Lasso (see Zou, Hastie and Tibshirani 
(2007) and Tibshirani and Taylor (2012)). 

Our motivation for studying DF in this generality is motivated by problems 
in shape constrained regression. In shape restricted regression the observations 
{(xj,|/i) = satisfy 

Vi = fi^i) + e*, for i = 1,..., n, (4) 

where ei,...,e„ are i.i.d. A^(0,cr^) errors, xi,...,x„ are design points in 
(d > 1) and the regression function / is unknown but obey certain known restric¬ 
tions like monotonicity, convexity, etc. Letting 6* = (/(xi),..., /(x„)), equation 
(4) can be rewritten as in (1) where the known shape restriction on / translates 
to linear constraints on 6* whereby 6* E C for some suitable closed convex set C. 

We briefly introduce the two main examples we will study in detail in this paper 
below, namely isotonic and convex regression. 

Example 1 (Bounded isotonic regression) If / is assumed to be non-decreasing 
and the xfs are univariate and ordered (i.e., xi < X 2 < ■ ■ ■ < Xn), then 6* E Ai, 
where 

M := {0 e < 02 < • • • < On}. (5) 

Isotonic regression has a long history in statistics; see e.g., Brunk (1955), Ayer 
et al. (1955), and van Eeden (1958). Isotonic regression can be easily extended to 
the setup where the predictors take values in any space with a partial order; see 
Section 3 for the details. In fact, for multivariate predictors, to avoid over-htting, 
a more useful formulation would be to consider bounded isotonic regression: / is 
assumed to be non-decreasing and the range of / is bounded by A, for A > 0. 

In Section 3, we show that for bounded isotonic regression 0* E C where C is a 
closed polyhedral set (i.e., an intersection of hnitely many hyperplanes) that can 
be expressed as 

C := {6 eMT : AO < b} (6) 

for some suitable matrix A G and a vector b G where c := [ci,..., Cm]'^ < 

b := [bi ,..., bjn]^ means that Cj < bi, for alH = 1,..., m. 

Example 2 (Convex regression) In convex regression (see e.g., Hildreth (1954), 
Kuosmanen (2008), Seijo and Sen (2011), Lim and Glynn (2012), Xu, Chen and 
Lafferty (2014)) / : —>■ M is known to be a convex function (see (4)) and 

xi,..., x„ is the set of design points in d> 1. Letting 6* = (/(xi),..., /(x„)), 
it can be shown that the convexity of / is equivalent to 6* belonging to a convex 
polyhedral set C. When <7=1 and the xfs are ordered, C has a simple character¬ 
ization: 

_ O 2 — 0i ^ ^ On — On-l 

X 2 Xl Xyi Xji—l 


c = Ioe 


(7) 
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For d > 2, the characterization of the underlying convex set C is more complex. In 
fact, C can be expressed as the projection of the higher-dimensional polyhedron 

Q-= {{^,6) : A^ + B6 <c} (8) 

onto the space of 6, where ^ ;= [^ 7 ; • • • 5 is the auxiliary vector representing 
the subgradient of /(xj), for j = 1,..., n, and A, B and c are suitable matrices; 
see Section 4 for the details. 

Let us come back to the general problem described in (1). In the following we 
briefly describe our main results and contributions. 

• Given a convex polyhedron C of the form ( 6 ) a natural estimator 0(y) of 
0 * G C is the projection of y onto C, i.e., 

^(y) = ^c(y) := argmin||y - 0||2 (9) 

where 11-112 denotes the Euclidean norm. In Section 2, after developing the 
necessary background, we briefly review the results on the characterization 
of divergence when C is a convex polyhedron from Tibshirani and Taylor 
(2012). We utilize these results to study the divergence of the projection 
estimator 0[y) in univariate convex regression. Further, in Section 3, we use 
these results to study another important class of shape-restricted regression 
problems, namely, bounded isotonic regression (see Example 1), where the 
design points are allowed to belong to any partially ordered set. We show 
that in this problem the divergence Vy6{y) is equal to the number of con¬ 
nected components of the graph induced by the partially ordered set and 
the estimator 0{y). 

We also establish an interesting connection between the solutions of bounded 
isotonic regression and (unbounded) isotonic regression on a general par¬ 
tially ordered set. In particular, we show that the LSE for bounded iso¬ 
tonic regression can be easily obtained by appropriately thresholding the 
unbounded isotonic LSE (see Proposition 3.3). This result is also of inde¬ 
pendent interest. Further, using this property, we show the monotonicity of 
divergence (and DF) as a function of the model complexity parameter — 
this shows that DF indeed captures the model complexity — for bounded 
isotonic regression. 

• In Section 4, we study the class of regression problems that can be formu¬ 
lated as the projection of y onto a polyhedron C (not easily expressible as 
in ( 6 )) which is dehned as the projection of a higher dimensional polyhedron 
Q, i.e.. 


C := Proj 0 (Q) = {e eR^ :3^eRP such that (^, 0) e Q}, 
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where Q is in the form of (8). This class of problems include multivari¬ 
ate convex regression, for which DF has not been studied before. In fact, 
classical linear regression can also be easily expressed in this form. Since 
C = Projg(Q) cannot be explicitly represented as a system of inequalities 
as in (6), the analysis of the divergence is more challenging. To characterize 
the divergence Vy0(y), we develop a lifting approach by formulating the 
projection of y onto C as a partial projection of (a, y) (for an arbitrary 
a G M^) onto Q, and show that, for almost every y G 0{y) is locally 
equivalent to the partial projection of (a, y) onto the minimal face of Q 
containing 0{y) and the associated The divergence of 6{y) can then be 
characterized using the KKT conditions corresponding to the optimization 
problem of the partial projection onto this minimal face. 

This lifting formulation is further generalized to characterize the divergence 
of a broader class of regression problems that can be viewed as a linearly 
perturbed partial projection problem, namely, an optimization problem over 
Q whose objective function contains the Euclidean distance to y plus a 
linear function of the auxiliary variables i.e., 

min -||0 — yllo + d''^£, 

(^,0)es2" 


for some given vector d. We show that such a partial projection formulation 
includes many important problems in statistics, such as Lasso and gener¬ 
alized Lasso. Note that although the divergences and DF for Lasso and 
generalized Lasso have been characterized in Zou, Hastie and Tibshirani 
(2007) and Tibshirani and Taylor (2012) we demonstrate that we recover 
their results as straightforward consequences of a more general theory (see 
Theorem 4.6 and Section 4.2.3 for details). 

In Section 5, we generalize our framework to the class of regression problems 
that can be viewed as a quadratically perturbed projection problem, where 
the objective function contains the Euclidean distance to y plus a quadratic 
function of the auxiliary variables i.e.. 


min 

(4,0)6S 


1 

2 


\0 


I2 w 



( 10 ) 


A simple example of such a formulation is ridge regression, whose DF has 
been studied in Li (1986). In addition to recovering the result in Li (1986) as 
a special case of the general theorem, we further provide a new result on the 
divergence of penalized multivariate convex regression where we penalize 
the norm of the subgradient Due to the presence of the quadratic term 
in the divergence of 6{y) can no longer be given as the dimension of 
the minimal face of Q containing 0{y) and the associated To address 
this challenge, we utilize a classical result in analysis — implicit function 
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theorem, and apply it to the KKT system of equations of (10) to compute 
the value of divergence. Our proof technique based on implicit function 
theorem is quite general and can be potentially applied to more complicated 
shape restricted problems. 

• Finally, in Section 6 we discuss how our characterization of DF can help in 
model selection based on SURE. We further conduct empirical studies to 
demonstrate the performance of the estimator chosen by minimizing SURE 
for bounded isotonic regression and penalized multivariate convex regres¬ 
sion. Indeed, we see substantial gains in the performance of the estimators 
tuned using SURE. 

In the following we compare and contrast our results with some of the recent 
work on divergence and DF of projection estimators. Kato (2009) characterize the 
divergence of the projection estimator onto a convex set C under a smoothness 
assumption on the boundary of C. However, it can be difficult to apply the results 
in Kato (2009) to numerically compute the divergence for many convex sets C. 
For example, when C is a convex polyhedron, the method by Kato (2009) requires 
knowing a set of basis vectors for the face of C containing 6{y), which may not 
be easily obtained (e.g., when C = Proj 0 (Q) in (26)). In contrast, our method is 
computationally simple as it only uses the inequalities dehning Q directly (see 
e.g.. Theorem 4.6). Hansen and Sokol (2014) consider the closed constraint set 
C = C{B) where B C W is & closed set and (: W ^ MX is & (possibly non¬ 
linear) map satisfying some regularity conditions. Their main result (Theorem 3) 
requires the optimal solution f3 to be in the interior of B (which is almost never 
the case in the examples of interest to us) and a variant of the Hessian matrix of 
C,{(3) to be full rank (e.g., when C(/3) = X(3, it requires that X is full rank). 
The results in both the papers Kato (2009) and Hansen and Sokol (2014) can only 
deal with a constraint set that can be explicitly written as a set of inequalities 
(e.g., the general projected polyhedron Projg(Q) in (26) is not allowed) and can¬ 
not be applied to regularized estimators (e.g., generalized Lasso in Section 4.2.3 
and penalized multivariate convex regression in Section 5). Vaiter et al. (2014) 
study DF for a class of regularized regression problems which include Lasso and 
group Lasso as special cases. However, the paper does not consider constrained 
formulations and thus cannot be applied to shape restricted regression problems. 
Rueda (2013) utilize the results of Meyer and Woodroofe (2000) to study the 
DF for the specihc problem of semiparametric additive (univariate) monotone 
regression. 

In the recent papers Kaufman and Rosset (2014) and Janson, Fithian and 
Hastie (2015) the authors argue that in many problems DF might not be an 
appropriate notion for model complexity. They provide counter examples of sit¬ 
uations where DF is not monotone in the model complexity parameter (or DF 
is unbounded). However, most of these counter examples either involve noncon- 
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vex constraints or non-Gaussian or heteroscedastic noise — in Janson, Fithian 
and Hastie (2015) it is argued that such irregular behavior happens “whenever 
we project onto a nonconvex model”. Nevertheless, the main applications in our 
paper, namely, bounded isotonic regression and penalized convex regression, cor¬ 
respond to projections onto polyhedral convex sets with i.i.d. Gaussian noise so 
the irregular behavior of DF, observed in some of the counter examples, is not ex¬ 
pected to occur here. In fact, in Theorem 3.4 we prove that for bounded isotonic 
regression, DF is indeed monotone in the model complexity parameter. 

The paper is arranged as follows. We start with a brief review of some useful 
concepts from convex analysis in Section 2.1. A brief description of some problems 
in shape restricted regression and some basic results on the divergence of projec¬ 
tion estimators is given in Section 2.2. Sections 3, 4 and 5 develop our main results 
in stages, as described above. In Section 6 we discuss the use of the divergence of 
estimators, computed in the paper, to find appropriate tuning parameters in two 
examples, namely, bounded isotonic regression and penalized multivariate convex 
regression. We relegate all the technical proofs to the appendix. 

2. Background 

In this section, we provide the necessary background on convex analysis and 
present some existing results on DF for the projection estimator onto a polyhedral 
cone. 

2.1. Polyhedral Cone, Polyhedron and Projections 

We start with some dehnitions and notation. We denote by (■, •) the usual inner 
product in the Euclidean space. Recall that a set C C is a convex polyhedron if 
it can be represented as in (6) for some known matrix A ■= [ai,..., am]"'' € 
and a vector b := [6i,..., 6^]"'' G When b = 0, the set C is called a 

polyhedral cone, which is the intersection of finitely many halfspaces that contain 
the origin and can be represented as, 

C = {e eMA : AO < 0}. (11) 

A finite collection of vectors Oi, 62 ,... ,9^ G MA is affinely independent if the 
only unique solution to the equality system Yli=i — 0 Yli=i = 0 is 
Oi = 0, for i = 1 , 2 ,... ,k. The dimension dim(C) of C is the maximum number of 
affinely independent points in C minus one. We say that C has full dimension if 
dim(C) = n. The affine hull of C, denoted by aff(C), is the affine space consisting 
of all affine combinations of elements of C, i.e., 

k k 

OiOi : k > 0,9i E C,ai eR, cxi = 1 

i=l i=l 




aff(C) : 
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Note that C has full dimension if and only if aff(C) = Given a convex poly¬ 
hedron C, the interior of C, denoted by int(C), is dehned as 

int(C) := {0 G C : 3 e > 0 such that B^{6) C C} , 

where -Be(0) = {x G M” : ||x — 0||2 < e} is the Euclidean ball of radius e centered 
at 6. The boundary bd(C) of C is dehned as 

bd(C) := {0 G M*" : V e > 0, C n B,{6) ^ 0 and (M”\C) G B,{6) ^ 0 } . 

The relative interior relint (C) of C is dehned as its interior within ah(C), i.e., 

relint(C) := {0 G C : 3 e > 0 such that -Be(0) H ah(C) C C} . 

Similarly, the relative boundary relbd(C) of C is dehned as its boundary within 
ah(C), i.e., 

relbd(C) := {0 G ah(C) : Ve > 0, C G 5,(0) ^ 0 and (ah(C)\C) G 5,(0) ^ 0 } . 

For a given convex polyhedron C in the form of (6), a nonempty subset F C C 
is called a face of C if there exists J C {1,2,...,m} so that 

5 = (0 G M"' : (Uj, 0) = bi, \/i e J and (a*, 0) < bi, \/ i e J'^}, (12) 

where is the complement set of J. Note that the same face can be dehned 
by diherent J’s. A point 0 G C can belong to more than one face. The smallest 
face of C containing 0, in the sense of set inclusion, is called the minimal face 
containing 0. The following lemma, proved in the appendix, characterizes the 
affine hull of a face of a polyhedron. 

Lemma 2.1. For any face F of C, ah(5) = (0 G M” : (aj,0) = bi, 'ii G Ji?}, 
which is an affine space. 

The normal cone associated with a face F is dehned as 

N{F) := < h G M" : 5 C argmaxh^0 I . (13) 

I eec J 

From a geometric perspective, the normal cone of F is the set of directions in 
MX that are perpendicular to F and point outward from C (see Figure 1 for an 
illustration). In this paper, we will often deal with the polyhedron F -|- N{F) = 
(0 + h : 0 G 5, h G N{F)}, which consists of all points in that can be reached 
by moving a point in F along a direction in N{F)-, see Figure 1. As a consequence, 
the projection of a point in 5 -|- N{F) onto C will lie on the face F of C, which 
is stated as the following lemma. 

Lemma 2.2. Let 5 be a face of C. For any z G 5 -|- N{F), Pc{z) G F, where 
5c(z) is dehned in (9). 
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Fig 1 : Illustration of the normal cones of a polyhedron: The four vertices of the poly¬ 
hedron C are denoted by A, B, C and D, respectively. We denote each face 
of C by its vertices, e.g., Fad denotes the line segment connecting A and D 
(one-dimensional face) while Fa denotes the vertex A (zero-dimensional face). 
The normal cone of all one-dimensional faces have been depicted by the red 
arrows while the normal cone of all zero-dimensional faces are depicted by the 
red conic regions. The grey area corresponds to Fad + ^{Fad)- 




Fig 2 : An illustration of the difference between projection and restriction, where both 
4 and 9 are one dimensional. The restriction of Q on 0 when ^ = 0 is depicted 
by the red line segment in the figure on the left while the projection on 6 is 
marked by the red line segment in the figure on the right. This example is taken 
from Balas (2005). 

Now we introduce the concept of a projected polyhedron. Consider a polyhedron 
of a higher dimension 


Q,= {{^,e)eW+'^-.A^ + BO <c}, (14) 

where A = [ai,..., e and B = [bi,..., G and c e 

The projection of Q onto the subspace of 6 is defined as 

Proj 0 (Q) = {6 : 3^eRP such that (^, 6) e Q} , (15) 


which is also a polyhedron. We also note that although Proj 0 (Q) is a polyhedron, 
it is usually not easy to express it explicitly as a set of inequalities as in ( 6 ). 
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In addition to the projected polyhedron, we also introduce the restricted poly¬ 
hedron as follows. The restriction of Q on the subspace of 0 at point ^ is dehned 
as 

(16) 

which is also a polyhedron. When ^ = 0, we will omit ^ in the subscript and 
denote the restriction of Q at the point 0 as R{Q)- Note that the restriction of 
a polyhedron is not necessarily the same as the projection of it even when ^ = 0 
(see Figure 2 for an example). 

2.2. Degrees of Freedom 

DF is an important concept in statistical modeling as it provides a quantitative 
description of the amount of htting performed by a given procedure. Despite its 
fundamental role in statistics, its behavior is not completely well-understood, 
even for widely used estimators. 

In this section we review known results and present a few new results on the 
DF and the divergence of the projection estimator 0(y) (see (9)) when C is a 
convex polyhedron as dehned in (6). As shown in the following result, in such a 
scenario, the divergence of 6{y) can be calculated as the dimension of the affine 
space that 6{y) lies on. 

Theorem 2.3. Suppose that the projection estimator 6{y) is dehned in (9) 
where C is a convex polyhedron dehned in (6). The components of 6{y) are almost 
diherentiable, and V6i (Tth entry of V0(y)) is an essentially bounded function, 
for z = 1,... ,n. Let Jy be the set of indices for all the binding constraints of 
0(y), i.e.. 


Jy ■■= {I <i <m ■. (a^, 0(y)) = fej. (17) 

Then, 

^(y) =n- rank(Ajy) (18) 

for almost every (a.e.) y G M”, where Aj^ is the submatrix of A with rows indexed 
by Jy. Thus, df(0(y)) = n — E [rank(Ajy)]. 

The (almost) diherentiability of the components of 9{y) and the boundedness 
of VOi directly follow from the proof of Proposition 1 in Meyer and Woodroofe 
(2000). The divergence of 0{y) in (18) is a direct consequence of the following 
lemma given as Lemma 2 in Tibshirani and Taylor (2012). 

Lemma 2.4 (Lemma 2 in Tibshirani and Taylor (2012)). For a.e. y G M"', there 
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is a neighborhood U of y, such that for every z G t/, 

0(z) = Pc(z) = Pnif^) = argmin ||0 — z ||2 (19) 

0 

s.t. Aj^e = bjy, 

where H = {6 ■. Aj^O = bj^} is an affine space and Jy is dehned in (17). 

From Lemma 2.1, it is clear that H = aff(F), where the face F, which is rep¬ 
resented as in (12) with J = Jy, is the minimal face containing 6 {y). Intuitively, 
the above result says that Pc is locally a projection onto an affine space for a.e. y. 
With Lemma 2.4 in place. Theorem 2.3 can be proved easily by observing that 
D{y) = dim(i/) = n — rank(y4jy). For the sake of completeness, we present a 
proof of Theorem 2.3 based on Lemma 2.4 in the appendix. We also refer readers 
to Section 2.2 in Tibshirani and Taylor (2012) for more details. 

As a special case of Theorem 2.3, when C is a convex cone in (11), the divergence 
of 0{y) has been derived in Meyer and Woodroofe (2000). This special case hnds 
several applications in univariate shape-restricted regression problems as shown 
below. 

Example 2.5 (One-dimensional Isotonic regression). In one-dimensional iso¬ 
tonic regression (see e.g., Robertson, Wright and Dykstra (1988, Chapter 1)), 
the polyhedral convex cone under consideration is the (nondecreasing) monotone 
cone Ai as dehned in (5). From the discussion following Proposition 1 of Meyer 
and Woodroofe (2000) it follows that D{y) equals the number of distinct values 
of 01 , ..., On- 

Now, we utilize the special case of Theorem 2.3 when C is a polyhedral cone 
(or equivalently. Proposition 1 form Meyer and Woodroofe (2000)) to derive the 
DF for univariate convex regression. 

Example 2.6 (Univariate convex regression). Consider the regression model (4) 
where now we assume that the regression function / : M —)■ M is convex. The 
convex regression model can be expressed in the sequence form as (1) with the 
constraint set C in (7). Obviously C is a convex polyhedral cone, which can be 
represented in the form of (11) with m = n — 2. In particular, each row a* is 
a sparse vector with only three non-zero elements: = Xj+i — Xj+ 2 , CLi,i+i = 

Xi +2 — Xi and a ,,*+2 = Xi — Xj+i. The divergence of 0(y) for univariate convex 
regression can be easily calculated according to the following proposition. 

Proposition 2.7. Let 0 < s < n — 2 denotes the number of changes of slope of 
the £t 0{y). Then, D{y) = s -f 2 for a.e. y G M"'. 
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3. Bounded Isotonic Regression 

It is well-known that the projection 0{y) of y onto the isotonic cone Ai (see (5)) 
or its multivariate analogue (to be described later in detail in this section), suf¬ 
fers from the spiking effect, i.e., over-fitting, especially towards the boundary 
of the convex hull of the predictor(s) (see Pal (2008) and Woodroofe and Sun 
(1993)). However such monotonic relationships among variables arise naturally 
in many applications and this has lead to a recent upsurge of interest in regular¬ 
ized isotonic regression; see e.g., Wu, Meyer and Opsomer (2015), Luss, Rosset 
and Shahar (2012), and Luss and Rosset (2014). Probably the most natural form 
of regularization involves constraining the range of 0{y), i.e., max0j — min0j. 
This leads to bounded isotonic regression. Thus, the univariate bounded isotonic 
regression can be represented as in (1) with the constraint set 

C = {6 : 01 <...< On, and Or, - 01 < A}, (20) 

for some hxed A > 0. Let us hrst see how to characterize the divergence of the 
projection estimator onto C in (20). 

First we note that the set C in (20) is a convex polyhedron rather than a 
polyhedral cone due to the additional boundedness constraint 0^ — 0\ < A. In 
particular, the set C can be represented in the form of (6) with A G and 

b G Each row a*, for 1 < i < n — 1, has -|-1 and —1 in the Pth and (i-l-l)’th 

positions with the remaining entries being zeros; while the last row a„ has -|-1 
and —1 in the n’th and 1st positions with the remaining entries being zeros (see 
an example of A for n = 5 in Figure 3(a)). The vector b only has one non-zero 
element at the n’th position, i.e., bn = A. 

An interesting observation that we make here is that the matrix A is the 
incidence matrix^ of the graph G dehned as follows (we say that G is induced 
from A). G has the vertex set of cardinality n corresponding to {0j}f^i, i.e., 
V{G) = {1,... ,n}. The edge set contains n edges: for 1 < i < n — 1, there is an 
edge that runs from node 0i to 0i+i and the n-th edge runs from 0n to 0i, i.e., 
E{G) = Cflifi —)-i-|-l}U{n—j-1}. An example of the graph G induced from 
the matrix A when n = 5 is shown in Figure 3(b). 

For a given G, let u:{G) denote the number of connected components of the 
undirected version of the graph G (removing the directions of edges in G), i.e., the 
number of maximal connected subgraphs of G. For example, for the graph G in 
Figure 3(b), uj{G) = 1. With these notations in place and utilizing Theorem 2.3, 
we characterize the divergence of the projection estimator for univariate bounded 
isotonic regression in the following proposition. 

^The incidence matrix of a directed graph has one column corresponding to each node of the 
graph and one row for each edge of the graph. If an edge runs from node a to node b, the row 
corresponding to that edge has -|-1 in column a and —1 in column b. 
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(a) Matrix Aj^ 

Fig 4: The matrix Aj and the induced graph Gj . 


Proposition 3.1. Let Ox{y) be the projection estimator onto the set C dehned 
in (20), that can be represented as C = {0 G M"" : A 6 < b} for appropriate choices 
of A and b. Let G be the graph induced from A, Jy := {1 < i < n : (aj, 0A(y)) = 
bi}, and Gj^ be the subgraph of G induced by Aj^. The divergence of 0a( y) is 
the number of connected components of Gj^ for a.e. y, i.e., D{y) = u{Gjy), and 
therefore df(0A(y)) = E[a;(Gjy)]. 

The characterization of divergence in Proposition 3.1 not only has interesting 
connections to graph theory but also leads to a computational advantage. In 
fact, it is straightforward to compute oj{Gjy) using either breadth-hrst search 
or depth-hrst search in linear time in n, which is computationally cheaper than 
directly calculating the rank of Aj^ in Theorem 2.3. 

Example 3.2. Let us work out the conclusion of Proposition 3.1 for a toy ex¬ 
ample with n = 5. Suppose that we have 9x,i = Ox ,2 < Ox ,3 = 6 *a ,4 < 0 x ,5 and 
0\,5 = 0x,i + A. Then Jy = {1, 3, 5} and the corresponding Aj^ and G are pre¬ 
sented in Figure 4. From Figure 4, Gj^ has 2 connected components {1, 2, 5} and 
{3,4} and thus D{y) = u{Gj^) = 2. It is of interest to compare this with the 
univariate (unbounded) isotonic regression example (see Example 2.5) where the 
divergence of 0(y) would be 3 (i.e., the number of distinct values of Ois) instead 
of 2. 

Remark 3.1. For univariate (unbounded) isotonic regression, the result in Ex¬ 
ample 2.5, which shows that the divergence of 0(y) is the number of distinct val- 
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ues of Ofs, can be viewed as a simple consequence of Proposition 3.1. To see this, 
suppose that there are s distinct values of 9fs and let 1 < ri < ■ ■ ■ < < n — 1 

be the values of k for which 6 r^ < Or,^+i. Then Jy = {1,..., n — ..., 

and the corresponding Gj^ has s connected components: 

{1 ,..., ri}, {ri + 1 ,..., r 2 },..., {r^-i + 1 ,..., n}. 

By Proposition 3.1, the divergence of 6 {y) is uj{Gjy) = s, which equals to the 
number of distinct values of 6 fs. 

We now extend our analysis to isotonic regression on any partially ordered set; 
see e.g., Robertson, Wright and Dykstra (1988, Chapter 1). Let X := {xi,... ,x„} 
be a set (with n distinct elements) in a metric space with a partial order, i.e., 
there exists a binary relation < that is reflexive (x < x for all x E X), transitive 
{u,v,w E X, u <v and v < w imply u <w), and antisymmetric (u,v E X, u < 
V and V < u imply u = v). Consider (4) where now the real-valued function 
/ is assumed to be isotonic with respect to the partial order <, i.e., any pair 
u,v E X, u < V implies f{u) < f{v). We further assume a boundedness constraint 
on / of the form maXxsx f{x) — mina,g;f/(2^) < A, for A > 0. A commonly 
studied special case of this model is (bounded) bivariate isotonic regression where 
X = {{a,b)}i<a,b<q has the partial order (a, 6) < {a',b') if and only if a < a' 
and b < b'. Letting 9ab = f{{ci,b)), the boundedness constraint translates to 
9qq — 9ll < A. 

This model (i.e., bounded isotonic regression on a partially order set) can be 
expressed in the sequence form as (1) where the isotonic constraints on 6 are of 
the form 9i < 9j if Xj < Xj, for some i,j E {1,... ,n}. The induced graph from 
these isotonic constraints is denoted by G = {V,E) where V = {1,... ,n} and 
the set of directed edges is E = {{i,j) ■ Xj < x^}. It is easy to see that G is 
an acyclic directed graph. When the range of / is known to be bounded (from 
above) by some A > 0, we can impose this boundedness restriction of / by adding 
the following constraints. For any node i, we dehne 

n+{i) ■= {j eV : {i,j) E E} 

to be the set of elements that are “greater than i" with respect to the partial 
order (i.e., successors of i), and 

n~{i) := {j e R : (j,z) G E} 

to be the set of elements that are “smaller than i" (i.e., predecessors of i). The 
maximal and minimal sets of V with respect to this partial order are 


max(R) = {i E V : n~^(i) = 0} and min(R) = {i E V : n (i) = 0}. 
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For each i G min(V^) and j G max(l^), we add a constraint 6j < 6i + X to 
impose the boundedness restriction on the range of /. Thus, the constraint set 
for bounded isotonic regression takes the following form: 

C := {6 £ : 9i < 6j, V {i,j) £ E, 6i < 6j + A,i G max(F),j £ min(F)}, (21) 

which can be easily represented as a convex polyhedron of the form (6). The set 
C in (21) is a generalization of (20) for univariate bounded isotonic regression 
where E = {{i,i + 1) : i = 1,..., n — 1}, min(l/) = {1} and max(l/) = {n}. 

We dehne a directed graph G = (V, E) where A, obtained from representing C 
in (21) in the form (6), is the incidence matrix of G. It is straightforward to see 
that G is a subgraph of G since E = EU{{i,j) : i G ma.x{V),j G min(I/)}. Using 
the same proof technique as that of Proposition 3.1 we can derive the following 
expressions for the divergence and DF of the projection estimator Ox{y) = Pc{y) 
for bounded isotonic regression: 

D{y)=u{Gj^) for a.e. y, and df(0A(y)) = IE[a;(Gjy)], (22) 

where Gj^ is the subgraph of G induced by Aj^ with Jy dehned in (17). We 
also note that for unbounded isotonic regression (i.e., A = +oo) on the partially 
ordered set A! the characterization of divergence in (22) still holds with G replaced 
with G and A being the incidence matrix of G. 

In the following we show that the divergence Dx{y) in (22) (where we make the 
dependence on the model complexity parameter A explicit) and thus DF is non¬ 
decreasing in A. To show this we hrst present an important connection between the 
solution of bounded isotonic regression and that of unbounded isotonic regression 
(i.e., A = -|-oo) which is of independent interest by itself. 

We start with some notations. It is well known that the LSE for unbounded 
isotonic regression 6 has a group-constant structure (here y is suppressed for 
notational simplicity). That is, there exists a partition f/i, f/ 2 ,..., of U = 
{1,... ,77,} (i.e., Us’s are disjoint and V = lJI=i Ps) such that di = Og for each 
i G Ug, for 1 < s < r. Moreover, without loss of generality, we assume that 
61 < 62 <■■■< Or- Let Ox be the projection estimator (LSE) for bounded 
isotonic regression with the parameter A (i.e.. Ox = argmin^g^ ||y — 0||| with C 
in (21)). The next proposition shows that Ox can be obtained by appropriately 
thresholding 0 . 

Proposition 3.3. For any dr — 0i > X > f), there exists a unique constant Lx, 
depending on A, such that 

6x,i = max(LA, min(LA -h A, Og)), for i G Ug. (23) 

Moreover, Lx is non-increasing in A. 
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The key to the proof of the above result is to find appropriate values of dual 
variables such that the primal solutions in (23) and dual solutions together satisfy 
the KKT condition of min^gc lly^^lli with C in (21). We achieve this by designing 
a transportation problem, which is a classical problem in operations research (see, 
e.g., (Dantzig, 1959, Chapter 14)). The dual solutions are constructed based on 
the solution of such a transportation problem. Please refer to the proof in the 
appendix for details. 

We also note that when X > Or — the boundedness constraint will be non- 
effective and 6 \ = 6 . Combining Proposition 3.3 with (22) we obtain the following 
theorem which shows that DF is an appropriate measure of model complexity for 
bounded isotonic regression. 

Theorem 3.4. For any given y G M"" the divergence of 0\{y) is non-decreasing 
in A. This implies that df(0A(y)) is non-decreasing in A. 

4. DF under Projected Polyhedral Constraints 

For some estimation problems the constraint set takes the form of a projection 
of a higher-dimensional polyhedron. In particular, let Q be a higher-dimensional 
polyhedron on the product space of the parameters 0 and some auxiliary variables 
i, i.e., 

Q ;= {(^,0) e : Al^ + 50 < c}, (24) 

where A = [ai,..., aim\^ G and B = [bi,..., hm]^ G and c G 

Similar to (12), we call a non-empty subset F of Q the face of Q if there exists 
J C {1, 2 ,..., m} so that 

F = {{tO)eQ: (a,, $) + (b„ 0) = c, V * G J}. (25) 

We assume that the constraint set C is the projection of Q onto the parameter 
space of 0 , i.e., 

C := Proj 0 (Q) = {0 G : 3 ^ G such that (^, 0) G Q}. (26) 

The projection estimator 0{y) takes the form 

0(y) = argmin \\6 - y\\l. (27) 

®eProjg(Q) 

From the dehnition of Proj 0 (Q) in (26), (27) is equivalent to solving the following 
optimization problem: 

(^(y),?(y)) = argmin || 0 -y ||2 

s.t. -I- B6 < c. 


( 28 ) 
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which is can be viewed as a partial projection of (a, y) onto Q, for an arbitrary 
a G By a partial projection (which is different from the standard projection) 
of (a, y) onto Q we mean that the solntion of (28) is fonnd by only minimizing the 
distance from 0 to y regardless of the distance from ^ to a. Note that althongh the 
component 0{y) is nnique, due to the strong convexity of the objective function 
in 6, the component ^(y) is not necessarily unique. Given the formulation in (28), 
one is interested in the DF of 6{y)- 

Example 4.1 (Linear regression). One well-known example of (28) is linear 
regression. In particular, given the response vector y G M” and the design matrix 
X G the ordinary LSE is defined as 

3(y) G argmini||y-X/3||2. (29) 

For the purpose of model selection, it is of great interest to compute the DF 
of Xf3{y), namely, df(X/3(y)). By setting ^ = f3 and 6 = Xf3, (29) can be 
reformulated as a special case of (28), i.e., 

(^(y),?(y)) e argmin^||0-y||2 (30) 

“ (i) 

Example 4.2 (Multivariate convex regression). Another example of (28) is mul¬ 
tivariate convex regression (see e.g., Seijo and Sen (2011)) which can be ex¬ 
pressed as (4) where / : —)■ M (d > 1) is a convex function and X : = 

{xi,...,x„} is the set of design points (with n distinct elements) in Let¬ 
ting 6* = (/(xi),... ,/(x„)), by the convexity of /, 6* belongs to a constraint 
set C, which is characterized in the following lemma; see e.g., Kuosmanen (2008), 
Seijo and Sen (2011), Hannah and Dunson (2011), Lim and Glynn (2012). 

Lemma 4.3. Gonsider the multivariate convex regression example discussed 
above. Let 6 G M". Then 0 G C iff there exists a set of n d-dimensional vec¬ 
tors ^ 1 ,..., ^„ G such that the following inequalities hold simultaneously: 

<dk-Oj, for all j 7 ^ A; G {1,... ,n}. (31) 

The characterization of C in Lemma 4.3 is quite intuitive: since / is a multi¬ 
variate convex function, we have for any pair Xfc,Xj G X, 

fi^k) - /(xj) > (c/(xj),Xfc - Xj), (32) 

where gixj) G dfipcj) is a sub gradient of the convex function / at x^. Letting 
= gixj), one can easily see the equivalence between (32) and (31). Further, 
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let ^ = [^7; • • • ^ The set of n{n — 1) inequalities in the dual 

representation in (31) can be represented as polyhedral constraints on (^, 0) as 
in (24) with p = nd and c = 0. Therefore, multivariate convex regression is a 
special case of the optimization problem described in (28). 


4 . 1 . Linearly Perturbed Partial Projection 


Instead of establishing the DF of 0{y) in (28), we further generalize the objec¬ 
tive function in (28) to include a linear term of ^ to encompass more statistical 
applications (e.g.. Lasso and generalized Lasso): 

(^(y),?(y)) e argmin^||0-yH^ (33) 

s.t. -I- B6 < c. 


where d G is a given vector. We call the problem (33) a linearly perturbed 
partial projection of (a, y) G for an arbitrary a G due to the linear term 
Note that the optimization problem (33) reduces to (28) when d = 0. As 
in (28), the component ^(y) is not necessarily unique. When there exists multiple 
^(y)’s satisfying (33), ^(y) can be chosen to be any one of the multiple solutions. 


Example 4.4 (Lasso and generalized Lasso). The generalized Lasso can be for¬ 
mulated as the following optimization problem (Tibshirani and Taylor, 2011, 
2012 ): 


3(y) 6 Mgmm-||y - A'/3||^ + t||D/3||i, 


(34) 


where D = [di, d 2 ,..., d/]"'^ is a given I x d matrix. When D = Id, it reduces to 
the standard Lasso problem. To see why (34) is a special case of (33), note that 
(34) can be re-written as 


( 3 (y), 7 (y))e argmin ^||y-X/3||2+ rl^ 7 . (35) 

—'7<D/3<'7 ^ 

Letting 6 = Xh, the formulation in (35) is further equivalent to 

(^(y),3(y),7(y)) e argmin^|| 0 -y ||2 + rl ^7 (36) 

S.t. XjS — 6 < 0 
-X/3 + 0< 0 
Df3 — 7 < 0 
—Df3 — 7 < 0. 


Observe that the optimization problem in (36) is a special case of (33) by setting 
^ = (/ 3 ^, 7 ^)^. Tibshirani and Taylor (2012) computed the DF of X(3{y). In 
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this section we will show that the result of Tibshirani and Taylor (2012) can be 
obtained as a direct consequence of our theorem on DF for 6{y) in the general 
framework of (33). 

When d 7 ^ 0, the optimization problem (33) may have an unbounded optimal 
value depending on d. The following result gives the necessary and sufficient 
condition for (33) to be bounded. 

Lemma 4.5. The optimization problem in (33) has a bounded optimal value if 
and only if — d = X for some A > 0. 

The proof of Lemma 4.5 is based on Farkas’s lemma (see e.g., Rockafellar (1970, 
Corollary 22.3.1)) and is provided in the appendix. Based on the above lemma, 
for the rest of the paper, we will assume that —d = for some A > 0 so 
that (33) is bounded. When d = 0, such an assumption trivially holds. Given the 
optimization problem in (33), the divergence and DF of 0{y) can be characterized 
by the following theorem. 

Theorem 4.6. The optimal solution 6 {y) to the optimization problem in (33) 
is unique for each y. The components of 0{y) are almost differentiable, and V6i 
is an essentially bounded function, for each i = 1, ... ,n. Let 

Jy := {I < i < m : (ai,|(y)) + (bi,0(y)) = a}. (37) 

Further, let Jy C Jy be the index set of maximal independent rows of the matrix 
[Aj^,Bjy]. Thus the set of vectors {[a7,b7] : i G Jy} are linearly independent. 
We have 

^(y) =n - \Iy\ + rank(A7^) 

for a.e. y G M"', where Aj^ and Bi^ are the submatrices of A and B with rows in 
the set Jy and |Jy| is the cardinality of the set Jy. Therefore, 

df(0(y)) = n- IE[|Jy|] + E [rank(R 7 y)] . 

As a simple validity check of Theorem 4.6, since Bj^ only has n columns, we 
have 

|Jy| = rank([A/y, Bj^]) <n + Tank{Ai^), 

which implies that D{y) > 0. We also note that, although 6 {y) is unique for 
any y, ^(y) is not unique for some y so that the index sets Jy and Jy dehned in 
Theorem 4.6 are not necessarily unique. However, for a hxed y, D{y) is unique so 
that these different Jy’s must lead to the same value of n — |Jy| + E [rank(A/y)]. 
To prove Theorem 4.6, we hrst prove a generalization of Lemma 2.4. 

Lemma 4.7. Let the index set Jy be as dehned in (37). For a.e. y G M”, 

6 {z) = 6{z), for any z in a neighborhood U of y, (38) 
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where 0(z) is defined in (33) and 0{z) is defined as the 0-component of the 
optimal solution of the following optimization problem: 

(0(z),|(z)) = argmin^||0-z ||2 + (39) 

s.t. Aj^^ + Bj^6 = cj^. 

We first note that from the definition of Jy in Theorem 4.6, the constraint in 
(39) is equivalent to = cj^. Let 

H = {(to) e : Aj^^ + Bj^e = CjJ. (40) 

From Lemma 2.1, we know that H = aff(F), where the face F, which is repre¬ 
sented in (25) with J = Jy, is the minimal face containing (0(y),^(y)). Similar 
to Lemma 2.4, Lemma 4.7 states that a linearly perturbed partial projection onto 
Q is locally equivalent to a linearly perturbed partial projection onto an affine 
space H for a.e. y. Thus, for a.e. y, we can change the domain of (33) from Q 
to H without changing the value of 0(y). With the domain being H, which is in 
the form of a system of equations, divergence of 0(y) can then be characterized 
using the KKT conditions of (39); see the proof of Theorem 4.6 in the appendix. 

Despite the similarity between Lemma 4.7 and Lemma 2.4, the proof of Lemma 
4.7 is technically more challenging due to the complex objective function and con¬ 
straints in (33). Under the setting of Lemma 2.4, y corresponds to an unique Jy 
defined by (17). However, under the setting of Lemma 4.7, because the compo¬ 
nent ^(y) of the solution of (33) is not always unique, the set of indices Jy in 
(37) can vary with ^(y). In other words, some y may correspond to multiple Jy’s. 
It is worth highlighting that the local equivalence in (38) not only depends on 
y but also on Jy, which appears in the constraint set of (39). The challenge in 
proving Lemma 4.7 is to first identify the set of points y for which (38) does not 
hold for at least one Jy, and then, show the set only has measure zero. In fact, 
we observe that such a y can only appear in the set 

y bd(proj<,(F) + fi_4Af(f))), (41) 

F is a face of Q ^ ' 

where N{F) and (F)) are defined in (13) and (16) respectively. Since the 

boundary of a convex polyhedron has measure zero and Q has finitely many faces, 
the set in (41) is a measure zero set. Because (38) holds for any y which does not 
belong to (41), the conclusion of Lemma 4.7 follows. 

We use the graphical illustrations in Figure 5 to show why any y that does not 
satisfy (38) is contained in the set (41). This also highlights the main ideas behind 
the proof of Lemma 4.7. In Figure 5 (which needs to be seen in color), we consider 
a simple polyhedron in with ^ of dimension one and 0 of dimension two. The 
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Fig 5 : Illustration of Lemma 4.7. 


eight vertices of this polyhedron are indexed hj A, BH as in Figure 5(a). 
Each face is indexed by its vertices, e.g., Fabcd stands for the face covering the 
top of this polyhedron. 

We start with the simple setting by assuming d = 0 and then present one 
instance of y G that satishes (38). Suppose that y G is an interior point 
in the pink region in Figure 5(b) (e.g., the blue point). The solution (0(y),^(y)) 
of (28) is unique and is marked by the red point on Fbc and 6 {y) is marked 
by the black point. In this case, Jy contains the inequalities which dehne both 
Fabcd and Fbcef- By the dehnition (39), (0(z),^(z)) is the partial projection 
of z onto the affine space Ff = a.S{FBc), which is identical to (0(z),^(z)) for 
all z in a neighborhood of y so that (38) holds. Next we consider three repre¬ 
sentative cases where y G violates (38) and show that such a y belongs to (41). 
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Unique ^(y) with d = 0: In Figure 5(b), suppose that y lies on the boundary 
of the pink region, e.g., the red line segment in Figure 5(b). Then, (38) does 
not hold for such a y. In fact, the solution (0(y),^(y)) of (28) is unique and 
lies on Fbc, and Jy contains the inequalities that dehne Fabcd and Fbcef- By 
(39), (0(z), ^(z)) is the partial projection of z onto the affine space H = aff(FBc)- 
However, for any neighborhood of y, we can hnd a z such that (0(z), ^(z)) ^ Fbc 
so that (38) does not hold. 

Next, we claim that the boundary of the pink region is contained in the set 
(41). In fact, we observe that the normal cone N{Fbc) is the green cone attached 
to the red point, the restriction R{N{Fbc)) is the green arrow, and the pink 
region is identihed as PYO}g{FBc) +R{Ff{FBc))- This implies that the boundary 
of the pink region is bd(Proj 0 (FB( 7 ) + which is contained in the set 

(41). 

Non-unique ^(y) with d = 0: Suppose that y lies on the boundary of the 
pink region in Figure 5(c), e.g., the blue point in the red line segment. Then, 

(38) does not hold for such a y. In fact, the component 0{y) is unique and 
identical to y while the component ^(y) is not unique. In particular, the red 
dotted line in Figure 5(c) represents all {6,$,) with 6 = y, the green point is the 
intersection point between the red dotted line and Fbcef, and the red point is 
the intersection point between the red dotted line and Fad- Any point in the 
red dotted line between the green and the red point corresponds to a solution 
(e{y),l{y)) of (28). 

If (0(y),^(y)) is chosen to be the red point, Jy contains the inequalities which 
dehne Fabcd and Fadhg- By (39), (0(z),^(z)) is the partial projection of z onto 
the affine space H = aS.{FAD)- However, in any neighborhood of y, there exists 
z such that (0(z),^(z)) is not in Fad so that (38) does not hold. 

We now show that the boundary of the pink region is contained in the set (41). 
In fact, we observe that the normal cone N{Fabcd) is the green ray attached to 
Fabcb, the restriction R{N{Fabcd)) is the singleton set { 0 }, and the pink region 
is identihed as PYO}g{FABCD) + R{N{Fabcd))- This implies that the boundary of 
the pink region is bd(Proj 0 (FABC 7 £)) + R{N{Fabcd))) which is contained in the 
set (41). 

Linearly perturbed partial projection with d 7 ^ 0: We would like to use 

Figure 5(d) to illustrate this case. Suppose that — d is the black arrow attached 
to Fbc and y lies on the boundary of the pink region; for instance, the blue point 
in the red line segment. Similar to the hrst case, the solution (0(y),^(y)) of (33) 
is on Fbc, and Jy contains the inequalities which dehne Fabcd and Fbcef- By 

(39) , (0(z),^(z)) is the linearly perturbed partial projection of z onto the affine 
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space H = aS.{FBc)- However, for any neighborhood of y, we can hnd a z such 


that (0(z),^(z)) ^ Fbc so that (38) does not hold. 


Again, we can show that the red line segment is contained in bd(Proj 0 (Fec) 
+R_d.{N{FBc))), and thus, contained in the set (41). In fact, the normal cone 
N{Fbc) is the green cone in Figure 5(d) whose restriction i?_d(iV(FBC')) is the 
green arrow. Therefore, PToig{FBc) + R-d{N{FBc)) is the pink region in Fig¬ 
ure 5(d) and its boundary, bd(Proj 0 (FBc) + R-d{N{FBc))) is contained in the 
set (41). Also note that, compared to Figure 5(b), the pink region is shifted in 
Figure 5(d) due to the linear term 

4-2. Applications of Theorem 4-6 

4-2.1. Linear Regression 

As a warm-up exercise, we show that for the ordinary LSE dehned in (29) an ap¬ 
plication of Theorem 4.6 establishes the well-known result df(X/3(y)) = rank(X). 

Proposition 4.8. Let /3(y) be the ordinary LSE dehned in (29). The divergence 
of 0{y) = X/3(y) equals rank(X) a.s. Thus, df(X/3(y)) = rank(X). 

4-2.2. Multivariate Convex Regression 

Using the characterization in Lemma 4.3, the multivariate convex regression prob¬ 
lem can be formulated as the following optimization problem: 


(^(y),^(y))= argmin \\0 - y\\l 
eeiR" 


(42) 


s.t. <0t-9j, 0 j yt k e {1,... ,n}, 


which is a special case of (28). Therefore, Theorem 4.6 can be directly applied to 


compute the DF of 0{y). 


To see this, we hrst represent the inequality constraints of (42) in the form of 
the constraints in (28), i.e., A^+B6 < c. In particular, A is a [n{n—l)]xnd matrix 
and each row of A is indexed by a pair r = (j, k) with j k E {1 ,... ,n} and each 
column is indexed by a pair c = (j', s) with j' G { 1 ,..., n} and s E { 1 ,..., d}. 
Moreover, we partition A into [n{n — 1)] x n blocks with each block of size 1 x d. 
Let Arj' be the block of A with row r = {j, k) and column j' G {1,..., n}. It is 
dehned as 
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The corresponding B is a. [n{n — 1)] x n matrix and each row of B is indexed 
by a pair r = {j,k) with j ^ k G {1,... ,?7,} and each column is indexed by 
c G {1,..., n}. Let Br,c be the entry in row r = [j, k) and column c of the matrix 
S. It is defined as 


B 


r,c 


1 if c = j, 

— 1 if c = k, 

0 if c ^ j,c ^ k. 


The corresponding c will be an all-zero vector in IR"'^"' 

To apply Theorem 4.6, we note that the index set (37) of the active constraints 
becomes 


Jy ■= {(t k) : - ^j) = 0k- Oj}. (43) 

Let ly C Jy be the index set of maximal independent rows of the matrix [Aj^, Bj^]. 
We have by Theorem 4.6, 

D{y) = n — \Iy\ + Tank{Ajy) and df(0(y)) = n — E[|/y|]-h E [rank(747y)] . 
4-2.3. Lasso and Generalized Lasso 

For the generalized Lasso problem described in (34), we characterize the DF 
df(X/3(y)) in the following corollary. 

Corollary 4.9. In the generalized Lasso problem in (34) and (36), for a.e. y G 

df(0(y)) = df(x3(y)) = E[dim(Xker(Do))], 

where Dq G is the sub-matrix of D consisting of the rows dj’s of D such 

that d7/3(y) = 0 and ker(Zi)o) = {x G : D^yi = 0} is the kernel of the row 
space of Dq. 

The above corollary recovers the result in Theorem 3 of Tibshirani and Taylor 
(2012) but is derived as a consequence of the general result in Theorem 4.6. The 
standard Lasso is a special case of generalized Lasso (see (34)) with D = Id- In 
the next corollary we provide the DF of Xf3{y) for the Lasso estimator /3(y); it 
recovers the result in Theorem 2 of Tibshirani and Taylor (2012). 

Corollary 4.10. In the Lasso problem (34) with D = Id, for a.e. y G M", 
df(§(y)) = df(Jf3(y)) = E[rank(Jfjj)], 

where Jq = {i E {1,..., d} : /3i{y) = 0}, Jg is the complement set of Jq and Xj^ 
consists of columns of X indexed by Jg. 
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5. DF with Quadratically Perturbed Projections 

As was the case with the multivariate isotonic LSE without any boundedness 
constraints, the multivariate convex LSE described in (31) tends to overfit the 
data, especially near the boundary of the convex hull of the design points - 
the subgradients take large values near the boundary. Thus, we might want to 
regularize the convex LSE. A natural way to achieve this is to impose bounds on 
the norm of the subgradients; see e.g.. Sen and Meyer (2013), Lim (2014). In the 
penalized form this would lead to the following optimization problem: 

(^A(y),?A(y)) 


The above optimization problem is actually a special case of the following more 
general problem: 


• l|2 

argmm-|| 0 -y ||2 + x 2^U, 




2 
jll2 


(44) 


i=i 


s.t. Xfc - Xj) <9k- Oj y j ^k. 


(Ox{y)Ax{y)) = argmin^||0-y||2 +^||^||2 (45) 

s.t. + BO < c, 

where A, B and c are suitable matrices of appropriate dimensions. We call prob¬ 
lem (45) a quadratically perturbed projection due to the quadratic term |||^||i. For 
the penalized multivariate convex regression problem in (44), the corresponding 
A, B, and c are given in Section 4.2.2. The divergence of 6 x{y), as the solution 
of the general optimization problem (45), is characterized by the following result. 

Theorem 5.1. For each given A > Oandy E M”, the optimal solution (^^(y), ^A(y)) 
to the optimization problem in (45) is unique. The components of Ox(y) are 
almost differentiable, and V(6'A)i is an essentially bounded function for each 
i = 1,... ,n. Let 

Jy := {1 < i < m : (ai,i^(y)} + (bi,0A(y)) = q}, (46) 

and Ajy and Bj^ be the sub matrices of A and B with rows in the set Jy. Further 
let ly C Jy be the index set of maximal independent rows of the matrix [Aj^, Bj^], 
i.e., the set of vectors {[a7,b7],i E ly} are independent. Then, we have, 

^(y) =n- trace j , (47) 

and df(0(y)) = E[iA(y)] (note that the index set ly is random). 
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We first note that the matrix Bj Bj + \Aj Al is invertible. To see this observe 

-'y J-y A J-y 

that, from the dehnition of Jy, the rows of Bg^] are linearly independent. 

Therefore, the matrix 



W4T 

Bj 


is invertible. The uniqueness of (0A(y),^A(y)) is due to the strong convexity of 
the objective function in (45). To characterize the divergence D{y) we introduce 
a lemma similar to Lemma 4.7, which characterizes the local property of the 
optimal solution of (45). 

Lemma 5.2. For each hxed A > 0 and a.e. y G M”, 

d{z) = 0(z), for any z in a neighborhood U of y, (48) 


where Ox{z) is computed via (45) and Ox{z) is computed via the following opti¬ 
mization problem: 


(^a(z),^a(z)) 


= argmin ^||0 - z||2 + ^||^||2 
6»,4 ^ ^ 

s.t. Aj^^ + Bj^6 = cj^, 


(49) 


where the index set Jy is dehned in (46). 


We hrst note that from the dehnition of Jy we can replace Jy in (49) by Jy 
without changing the dehnition of (0(z),^(z)). Let H = ah(F) be dehned as in 
(40), where the face F is represented in (25) with J = Jy. Similar to Lemma 
4.7, Lemma 5.2 states that a quadratically perturbed projection onto Q is locally 
equivalent to a quadratically perturbed projection onto an affine space H for 
a.e. y. With the constraint set changing to H for a.e. y, the divergence of 0{y) 
can be characterized using the KKT conditions of (49) and the implicit function 
theorem, a classical result in analysis (see Theorem 5.3 below). 

Note that it suffices to prove Lemma 5.2 for A = 1. The case when A 7 ^ 1 can 
be reduced to the case with A = 1 by letting 7 = a/A^ and reformulating the 
problem (45) as 


(0A(y),7A(y)) = 


s.t. 


argmin^||( 7 , 0 ) - ( 0 ,y)||^ 
0,7 ^ 


1 


A'y + BO < c. 


(50) 


which does not change the dehnition of Oxiy). 

When A = 1, it is clear from (50) that the optimization in (45) is in fact 
the regular projection of (0,y) onto Q := {{^,0) G : A^ + BO < c}. By 
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Lemma 2.4, (48) holds if (0,y) ^ bd(F + N{F)) for any face F of Q. Therefore, 
the local equivalence in (48) holds for any y ^ R{hd{F + N{F))). However, 
i?(bd(F + N{F))) may have a positive measure in the domain of y (i.e., M"’), 
although bd(F + N{F)) is a measure zero set in Therefore, we cannot 

prove Lemma 5.2 as a corollary of Lemma 2.4. 

To address this challenge, we identify a new set of measure zero and show 
that any y that does not satisfy (48) is contained in this measure zero set. In 
particular, such a set takes the form 

IJ hd(RiF + N{F))\ (51) 

F is a face of Q ^ •' 

which is a set of measure zero in M"' since R{F + N{F)) C M"’. 

In Figure 6 we provide a graphical illustration to show that any y that does 
not satisfy (48) is contained in the set (51). In Figure 6, the eight vertices of the 
polyhedron Q are indexed by A, B,H. 

Suppose that (0,y) is in the interior of the pink region, e.g., the blue point in 
Figure 6. We claim that such a y satishes (48). In fact, (0 a(z),^;^( z)) is the red 
point, which lies on Fef- Then, Jy contains the inequalities that dehne Fbcef and 
Fefgh- By the dehnition (49), (0(z),^(z)) is the quadratically perturbed projec¬ 
tion of z onto the affine space Ff = qS.{Fee)i which is identical to (0(z),^(z)) for 
all z in a neighborhood of y so that (48) holds. 

Let us now consider the case when y is on the boundary of the pink region, 
e.g., on the black solid line connecting vertices E and F in Figure 6. We claim 
that such a y does not satisfy (48). In this case, the solution (0(y),^(y)) still 
lies on Fee and Jy still contains the inequalities that dehne Fbcff and Feegh 
so that (0(z),^(z)) is still the quadratically perturbed projection of z onto the 
affine space Ff = a.S{FEE)- However, for any neighborhood of y, we can hnd a z 
such that (0(z),^(z)) ^ Fee so that (48) does not hold. Now we show that such 
a y is contained in the set (51). In fact, N{Fef) is the green cone in Figure 6 and 
R{Fee + N{Fef)) is the pink region. Hence, the boundary of the pink region is 
hd{R{FEF + N^Fef))) which is contained in the set (51). 

According to Lemma 5.2, Ox{y) and Ox{y) have the same local property so 
that we can characterize the divergence of Ox{y) as the divergence of Ox{y)- 
Since (0A(y),^A(y)) is the optimal solution of (49) which is an optimization 
problem with only equality constraints, it must satisfy the KKT conditions of 
(49), which is a system of equalities parameterized by y. Hence, the derivative 
of Ox{y) can be characterized by applying the classical implicit function theorem 
(stated below as Theorem 5.3) to this system of equalities. Note that, we provide a 
new connection between DF and the implicit function theorem, which is a general 
tool with potential applications to other (shape-restricted) regression problems. 
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Fig 6: Illustration of Lemma 5.2. 


Theorem 5.3 (Implicit function theorem). Let F : U ^ be defined in a 
neighborhood U C Qf (uo,vo) G Suppose that F is continuously 

differentiable, satisfies F(uo,vo) = 0, and Vvi^(uo,vo) is an 77-2 x 77,2 invertible 
matrix. Then there exists a neighborhood U^o F of uq and a continuously 
differentiable function /(u) : —>■ such that 

F(u,v)=0^v = f(u), 

for any u G Uuo and 

V/(u) = - [VvF(u, /(u))]-' [VuF(u, /(u))]. (52) 

To characterize the divergence of 0A(y) we view (0,^) and z in (49) as u and 
V in Theorem 5.3, respectively, and let F(0,^,v) = F(u, v) = 0 be the KKT 
conditions of (49). Hence, 0\{y) can be viewed as the implicit function induced 
by this KKT system whose derivative can be characterized by (52). Note that, 
we cannot directly apply the implicit function theorem to the KKT conditions of 
(45) because the corresponding KKT conditions involve inequalities and cannot 
be represented as a system of equalities of the form F(u, v) = 0. This shows 
the necessity of Lemma 5.2 which establishes the local equivalence between (49) 
and (45). 

We also note that the classical ridge regression, described as 

Pxiy) e argmin i||y - X(3\\l + ^\\(3\\l, (53) 

is a special case of the general optimization problem (45) by letting 0 = X(3. 
Theorem 5.1 can be applied to (53) to obtain df(X/3(y)). This recovers the results 
in Li (1986). 
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Corollary 5.4. In the ridge regression problem (53), for a.e. y G MX, df(X/3;^(y)) = 
trace (x (A/^ + X) . 


6. SURE and the Choice of Tuning Parameters 


Consider the general formulation of the problem posited in (1). Suppose that our 
estimator Ox{y) depends on a tuning parameter A, for A > 0. For example, in 
bounded isotonic regression, the projection estimator depends on the choice of 
the range of 6 (see (20)); in penalized convex regression (see (44)) the estimator 
depends on the tuning parameter A. 

In this section we use the SURE to choose the tuning parameter A. Let 

L^{X) = \\dx{y)-e*\\l (54) 

denote the loss in estimating 6* by 0\{y). We would ideally like to choose A by 
minimizing Ln{-)- Let 

A* := argminL„(A). (55) 

A>0 

We note that A* is a random quantity as Ln{X) is random. Of course, we cannot 
compute A* as we do not know 6*. However we can minimize an estimator of L„, 
assuming that a is known, as described below. Let 

Un{X) := ||y - 0A(y)||2 + 2cT^^(0A(y)) - na^, (56) 

where D{6x{y)) denotes the divergence of Ox{y)- It is well known that 

E[f/„(A)] = E[L„(A)], for all A > 0; 

see Stein (1981) (also see Proposition 2 of Meyer and Woodroofe (2000)). is 
usually called the SURE. Let 

A := argmin f/„(A) (57) 


be the minimizer of Un{X), which can be computed from the data (if cx^ is assumed 
known). Note that here we would need to compute the divergence of 6x{y), which 
we can calculate using the results in the previous sections. 


We study the ratio 


Ln(X) 

Ln{X*) 


(58) 


to gain insights into the performance of the SURE. Of course, the above ratio is 
always greater than 1, and we expect it to be close to 1 if SURE performs well. 
In the following, we empirically study the behavior of L„(A)/L„(A*) for bounded 
isotonic regression and penalized convex regression. 
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Fig 7: Comparison between the calculated DF using (22) and the true DF using the 
definition in (2). 


6.1. Bounded Isotonic Regression 

We generate n i.i.d. design points Xj ~ Unif[0,1]*^, for i = 1,... ,n. We set the 
regression function / : —>■ M to be /(x) = ||x|| 2 . For each pair (^, j), we put 

an isotonic constraint di < 6j whenever Xj < x^ pointwise. We further add one 
additional boundedness constraint max 6^4 — min^j < A, where A is the tuning 
parameter. We generate the response yi, for i = 1,... ,n, according to model (4) 
with = 1. 

We hrst use simulations to demonstrate that the characterization in (22) indeed 
gives the correct DF in this example, comparing it with the formal definition of 
DF, given in (2). We set n = 100 and d = 2 and vary the parameter A over an 
interval to achieve different levels of DF. When computing the DF, we use the 
empirical mean from 500 independent replications to approximate the expectation 
over the distribution of y. We calculate the DF using (22) and using its definition 
in (2) and plot the comparison in Figure 7. As we can see from Figure 7, the DF 
curve calculated using (22) (red line) is almost identical to the true DF curve 
obtained from (2) (blue line). This empirically demonstrates the correctness of 
( 22 ). 

Next, we demonstrate the performance of the selected parameter A using SURE. 
In particular, we compute the ratio L„(A)/Ln(A*) in (58) (we call this the SURE 
ratio), where is the squared loss defined in (54), A is the parameter selected 
via SURE in (57), and A* is oracle tuning parameter in (55). We also compare 
the performance of bounded isotonic regression to the unbounded one, which 
does not include the boundedness constraint max 9i — min 6i < X (or equivalently, 
set A = +oo). In particular, we calculate the ratio between the loss from un¬ 
bounded isotonic regression and the oracle loss, i.e., L„(cx))/L„(A*) (we call this 
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(c) d = 7 


2.6 

2.4 
2.2 

2 

! ■'■8 
' 1.6 

1.4 
1.2 

1 

0-81 



500 


-Unbounded 

-SURE 


1000 

n 


1500 


2000 


(d) d = 10 


Fig 8: Comparison between the unbounded ratio and the SURE ratio for isotonic 
regression. 


the unbounded ratio), and compare it to the SURE ratio Ln{X)/Ln{X*)■ 

We set d = 2,5, 7 ,10 and for each hxed d, we vary the sample size n = 100, 
200, 500, 1000, 2000 and compute the SURE and unbounded ratios over 100 
independent replications and plot the results in Figure 8. While calculating the 
SURE ratio we used the known value of a, which may not be available in a 
real application. As error variance estimation is a very well-studied problem in 
nonparametric regression and there are several methods already available in the 
statistical literature (see e.g., Dette, Munk and Wagner (1998), Kulasekera and 
Gallagher (2002), Miiller, Schick and Wefelmeyer (2003), Munk et al. (2005) and 
the references therein) we do not discuss this issue further. In practice any of 
these above methods could be used to estimate cx^. 

From Figure 8, one can see that the SURE ratios are, in general, much smaller 
than the unbounded ratios, illustrating the usefulness of including the bounded¬ 
ness constraint to penalize the model complexity in isotonic regression. Further, 
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SURE-tuned lambda lamabda=0 SURE-tuned lambda lamabda=0 


(a) n = 100, d = 2 (h) n = 100, d = 4 



SURE-tuned lambda lamabda=0 SURE-tuned lambda lamabda=0 


(c) n = 500, d = 4 (d) n = 500, d = 10 

Fig 9: Boxplots of the unbounded ratio and the SURE ratio for convex regression. 


we also observe that as n increases, the standard deviations of both the un¬ 
bounded ratio and the SURE ratio decreases, in most cases. Also, as expected, 
the need for regularization (penalization) is more apparent as we increase the 
dimension d of the problem. 

6.2. Penalized Multivariate Convex Regression 

We generate n i.i.d. design points x 

i Unif[—1, l]*^, for f = 1,..., n. We set the 
convex regression function / : —)■ M to be /(x) = ||x|| 2 , which is symmetric 

around 0 as x G [—1,1]*^. We generate the response ?/j, for i = 1,..., n, according 
to model (4) with a = 0.5. We use the CVX package (Grant and Boyd, 2014) to 
compute the penalized multivariate convex regression estimator, dehned in (44). 

We note that since the optimization problem for penalized multivariate convex 
regression in (44) has a lot of constraints and variables (i.e., n{n — 1) constraints 
and nd variables), we only consider smaller sample sizes (n) in our simulation ex¬ 
periments. Nevertheless, a smaller n is still sufficient to demonstrate the superior 
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performance of the parameter chosen by minimizing SURE. In particular, we con¬ 
sider d = 2,4,10, n = 100 and 500, and compute the SURE ratio Ln(A)/L„(A*) 
and the “un-penalized ratio” Ln{0)/Ln{X*) (i.e., the ratio between the loss ob¬ 
tained from the un-penalized multivariate convex regression estimator in (42) 
and the oracle loss). We present the results in the form of boxplots in Figure 9, 
obtained from 100 independent replicates of y (hxing the design variables). We 
observe that the penalized multivariate convex regression, with the regulariza¬ 
tion parameter tuned by SURE, has a better performance. As we had inferred 
from Figure 8, Figure 9 also shows that the SURE ratios are much smaller than 
the unbounded/un-penalized ratios and their difference is more pronounced as 
the dimension d increases. Further, the ratio Ln{X)/Ln{\*) concentrates near 1 
suggesting that SURE is doing a very good job in selecting the tuning parameter. 

7. Discussion 

In this paper we develop a novel framework for computing divergence and DF 
in convex constrained regression problems. Although our main focus is to study 
DF for shape constrained LSEs, we recover known results on DF for Lasso and 
generalized Lasso as straightforward consequences of our general theory. In the 
following we mention a couple of open research questions that emanates from our 
research. 

In Theorem 3.4 we show that for bounded isotonic regression DF is monotonic 
in the model complexity parameter. We expect (from the empirical studies we 
conducted) the same result to hold for penalized convex regression, although a 
proof of this is difficult due to the complicated structure of the constraints (see 
(44)) and beyond the scope of the present paper. 

We recommend the use of SURE to choose the tuning parameters in the two 
main examples discussed. In fact, our simulation studies illustrate the superior 
performance of SURE. However, a formal theory showing that SURE indeed 
works in shape restricted regression problems is still unknown. We believe that 
this is an interesting future research direction and a difficult one — in fact, very 
little is known on the theoretical performance of the SURE, even for the Lasso 
estimator. 

8. Appendix 

8.1. Proofs of Results from Section 2 

Proof of Lemma 2.1. Suppose that 6 G aff(F), i.e., 6 = where A; > 0, 

9j G F, Oj G M and = 1- ^or any i G Jp, {^ii9) = — 

Therefore, the inclusion C follows. 
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Suppose 6 satisfies (uj, 6) = hi for all i G Jp- We claim that there exists O' ^ F 
such that {sii,6‘ ') < hi for all i E Jp. In fact, by the dehnition of maximal index 
set Jp, there exists Oi E F for each i E Jp such that {sLi,6i) < hi. Then, O' 
can be chosen as (^iizj<^Oi)/\Jp\ E F. If 0 = O', 0 belongs to F C aff(F). If 
0 ^ O', there exists a sufficiently small 6^0 such that 0^ — aO T 
satishes (uj. Of) = hi for all i E Jp and (uj. Of) < hi for all i E Jf. Hence, 0,, E F 
which implies that 0 = 0^ /e + (e — f)0' /e E aff(F). Therefore, the inclusion D 
follows. □ 

Proof of Lemma 2.2. Since z E F + N{F), there exist z' E F and h G N{F) such 
that z = z' + h. Since z := Pc{’^) is the optimal solution of min^gc ||0 —z|| 2 , by the 
optimality condition (see e.g., Bertsekas, Nedic and Ozdaglar (2003, Proposition 
4.7.1)), we have 

{z — z,0 — z) = (z — z' — h, 0 — z) >0 
for any 0 E C. Choosing 0 = z' in the inequality above, we have 

(h,z-z') > \\z-z'\\l. 

As h G N{F), z' G F C argmax^g^^ h^0, which implies (h, z — z') < 0, again ap¬ 
pealing to the optimality condition. This, together with the above display implies 
z = z' G F. □ 

Proof of Theorem 2.3. The (almost) differentiability of components of 0(y) and 
the boundedness of V6i directly follows from the proof of Proposition 1 in Meyer 
and Woodroofe (2000). In particular, using the same argument as in the proof of 
Proposition 1 in Meyer and Woodroofe (2000), we could establish the Lipschitz 
continuity of 0(y), i.e., ||0(yi) - 0 (y 2)||2 < ||yi -y 2||2 for any yi,y 2 G M''. Then 
the a.e. differentiability directly follows from Rademacher’s theorem (Federer, 
1969). 

From Lemma 2.4, for a.e. y and every z in a neighborhood of y, we have 

argmin ||0 — z ||2 = argmin ||0 — z|| 2 , 

6 »eC eeH 

which implies that F(y) = VyFc(y) = VyFji/(y). Also, it is known that S/yPniy) 
is the dimensionality of the affine space. The affine subspace H in Lemma 2.4 
can be decomposed into a linear subspace L and a point 0 G M"’ where 

L := {0 G M” : Aj^O = 0} (59) 

and 0 G M"' satishes the equality Aj^O = bj^. Thus, H = L + 0, where the sum is 
the Minkowski sum. We also note that such a 0 always exists since 0(y) satishes 
Ajy0(y) = bjy. By the decomposition of H = L + 0, we have 

Ph{z) = Pl{z-0) + 0. 
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which further implies that 

VyPff(y) = dim(L) =n- rank(v4jy). 
This completes the proof of Theorem 2.3. 


□ 


Proof of Proposition 2.7. Denote the n — 1 slopes from the ht 0{y) by i.e., 


Oj+i - Oj 

^i+1 


for i = 1,..., n — 1. 


Let 1 < ri < ... < < n — 2 be the values of k for which f/r^+i > Thus, s 

denotes the number of changes of slopes of the fit 6{y). Then, by the definition 
of Jy in (17), 

Jy = {1,..., n — 2} \ {ri,..., Ts} and |Jy|=n —2 —s. 


Thus, 


H = {6 {sii, e) = 0 for ie Jy}. 


Since each a*, for l<i<n — 2, isa sparse vector with only three non-zero ele¬ 
ments at the positions i, z-f-l and i+2, we know that a^’s are linearly independent. 
Therefore, by Theorem 2.3, 


D{y) = dim{H) = n — rank(y4jy) = n— {n — 2 — s) = s + 2. 


□ 


8.2. Proof of Results from Section 3 

Proof of Proposition 3.1. By Theorem 2.3, D(y) = n — iarik{Ajf). Since Aj^ 
is the incidence matrix of the graph Gjy, by a fundamental result from al¬ 
gebraic graph theory (see e.g.. Proposition 4.3 from Biggs (1994)), we have 
rank(y4jy) = n — oj{Gjy), where oj{Gj^) is the number of connected components 
of Gjy. Therefore, we have 

^(y) =n- rank(Ajy) = n - {n - uj{Gj^)) = uj{Gj^), 

which completes the proof of Proposition 3.1. □ 

Proof of Proposition 3.3. For the given partial ordered set X with n elements, 
the graph induced from the isotonic constraints is denoted by G = (V, E) where 
V = {1,... ,77,} and the set of directed edges is E = {{i,j) ■ ^ ^j}- Recall 

that, the projection estimator for unbounded isotonic regression, denoted hy 6 = 
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( 01 ,, 6n)~^, is obtained by projecting y onto {6 G M" : 0, < 0„ V(^,j) e E}, 
and the projection estimator for bounded isotonic regression, denoted by Ox = 
(0A,i,..., Ox^n)~^, is obtained by projecting y onto 

C = {0 G M” : 0i < Oj\/{i,j) & E,0i< 6j + A,i G max(i/),j G min(i/)}, 

where ma.x{V) and min(h^) are the sets of maximal and minimal elements with 
respect to the partial order, respectively. It is well known that 6 has a group- 
constant structure, i.e., there exist disjoint subsets Ui, U 2 ,... ,Ur oiV = {1,..., n} 
with \Us\ = kg such that V = lJs=i ~ for each i E Ug. Moreover, we 

assume, without loss of generality, that r > 1 and 0i < 02 < ■ ■ ■ < 0^. 

Let (a;)+ = max{a:, 0} and (a;)_ = min{x,0}. We dehne 

r r 

H{L, A) ■.= Y,ks{L-9g)^ + J2ks{L + X- 0.)_ . (60) 

We hrst show that, for any A such that 0^ — 0i > A > 0, there exists an unique 
Lx such that 


H{Lx,\) = 0. (61) 

For any X <6r — 0i, it is easy to see that H{L, A) is a continuous, non-decreasing 
and piecewise linear function of L. If H{L,X) is not strictly increasing, there 
must exist < LX' such that L{{L^, A) = LI{LX, A). This means that H{L, A) is a 
constant on the interval [L^, L^], which further implies from the dehnition of the 
function in (60) that 

er-X<L^ <L^ < 01 . 

This contradicts with the fact that Qr — 9\ > A. Hence, E{L, A) is strictly increas¬ 
ing in function of L. Since limL^_oo E{L, A) = —00 and limi^+oo E{L, A) = -|-cxd, 
there exists an unique Lx satishes H{Lx, A) = 0. 

Next we show that this Lx is a non-increasing function of A. If not, there exist 
Ai and A 2 such that 0^ — 0i > A 2 > Ai > 0 and > Lx^- By the dehnitions of 
Lai and Laj, we have 


0 = H{Lx„X2) 

r 

= 'y ^ kg [Lx2 

S=1 

r 

> ^ ^ kg {Lxi 

S=1 

r 

^ ^ (-hAi 
s=l 
0 , 


0s) kg {Lx2 -I- A 2 — 0s) _ 

S=1 

r 

^s)_,_ + kg (Lai -|- A 2 — 0s) _ 

S=1 

r 

0s)_|„ + kg (Lai -|- Ai — 0s) _ 


> 
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where the hrst inequality holds because H{L, A) is strictly increasing in L and the 
second inequality holds because H{L, A) is non-decreasing in A. This contradiction 
indicates that L\ is a non-increasing function of A. 

For each node i E V, we denote the set of successors and the set of predecessors 
of i in the partial order by 

n^ii) ■= {j eV : {i,j) E E)} and n-(i) := {j E V : {j,i) E E}. 

According to the KKT conditions of isotonic regression, for e = {i,j) E E, 
there exists a dual variable Uij > 0 for the constraint 9i < 6j such that 

di-yi+ ^ Uij - Uji = 0, y i E V, (62) 

j&n+{i) 


and 

Uij0i - 9j) =0, V (f, j) e E. (63) 

Moreover, for any (i,j) G E such that i E Ut and j E Us and 9t < 9s, we have 
9i < 9j, and thus, Uij = 0. 

We expand the graph G to G = (V, E) where E = EU {{i,j) ■ i E niax(l/), j G 
niin(l/)} and dehne 

n^ii) ■= {j : {i,j) E E} and n-{i) := {j E V : {j,i) E E}. 

Similarly, according to the KKT conditions of bounded isotonic regression, for 
e = {i,j) G E, there exists a dual variable ux,ij > 0 for the constraint either 
9i < 9j or 9i < 9j + X such that 


^X,i Vi T 'y ^ Ux,ij y ^ Ux,ji 

= 0, V i G K, 

(64) 

ien+(i) j&n-(i) 



Ux,ij(,9 X,i ^X,j) 

= 0, V {i,j) E E, 

(65) 


u\,ij(9x,i - 9x,j - A) = 0, Vi G max(l/), j G min(l/)(66) 
To show that 6x dehned by 

9x,i = max{Lx, min(LA A, 9s)), for i E Us (67) 

is the optimal solution for bounded isotonic regression, it suffices to construct a 
non-negative value for each dual variables ux,ij for e = (i,j) G E, which satisfy 
the conditions (64), (65), and (66) together with 6x dehned by (67). 

We will do this by solving a transportation problem, which is a classical problem 
in operations research (see, e.g., (Dantzig, 1959, Chapter 14)). In a transportation 
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problem, some demands and supplies of a product are located in different nodes 
of a (directed) graph and we need to determine a transportation plan that sends 
products from the supply nodes along the arcs to meet the demands in the demand 
nodes. 

To construct the transportation problem, we consider a directed graph G = 
(y, E) with 

V ■.= {iev ■.di<Lxoiei>Lx + A}, 

and 

E := {{i,j) & E : i E V and j G j) E E : 6i < Lx and 6j > + A} 

where Lx is the unique value satisfying (61). Note that G is a subgraph of G 
containing the arcs in E whose both ends are in V <Z V. We also define 

n^{i) ■= {j EV : {i,j) E E} and n-{i) := {j E V : {j,i) E E}. 

We claim there is at least one node i E V with 6i < Lx- If not, since 9r — 9i > A, 
we will have Lx < 9i < 9r — X so that Lx + X < 9r- As a result, we have 
H{Lx,X) < 0 contradicting (61). Similarly, we can show there is at least one 
node i E V with 9i > Lx + X. 

Then, to each node i E V with 9i < Lx, we assign a demand of Lx — 9i > 0. 
To each node i E V with 9i > Lx + A, we assign a supply of 9i — Lx — X > 0. The 
decision variables of the transportation problem is denoted by 6ij > 0, for each 
{i,j) G E, which represents the amount of products shipped from node i to node 
j along arc To hnd a shipping plan so that the demands are satisfied by the 
supplies, we want to hnd 6i/s to satisfy the following flow-balance constraints 

Lx-9i+ for tEV,9,<Lx (68) 

j&h+{i) 

Lx + A — 9i ^ ^ Sij — ^ ^ 6ji = 0, for i E V, 9i > Lx + A. (69) 

jeh+{i) j&h-{i) 

The constraint (68) means, for a demand node, the total amount of in-how minus 
the total amount of out-how must equal its demand. The constraint (69) means, 
for a supply node, the total amount of out-how minus the total amount of in-how 
must equal its supply. 

Then, we show that there exist 6ij > 0 such that (68) and (69) hold by the 
following three observations. 

• First, we note that the total demand is 

r 

[Lx-9?j=J2ks{Lx-9s)^ 

i&v, ei<Lx 
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and the total supply is 

r 

i&V, F>Lx+\ 

Since Lx satisfies (61), the total demand above equals the total supply. 

• Second, given any i E V with 9i > Lx + \, let j be a successor of i in the 

partial order. Then, j must be in V also because 9j > + A. As a 

result, max(l/) f) ^ 7^ 0 ^.nd there must exist a directed path in G from each 
node i E V with 0* > La + A to a maximal node in ma.x{V) f) V. Similarly, 
we can show that min(l/) ^ ^ and there must exist a directed path in 
G from a minimal node in min(l/) f) 1/ to each node i E V with 9i < Lx- 

• Third, by definition, G contains every arc from a node in ma.x{V) f) Id to a 
node in min (Id) f) Id. 

By these three observations above, there always exist a shipping plan that 
exactly matches supplies to demands in all nodes in G. Therefore, there exist 
6ij > 0 satisfying (68) and (69). 

Next we construct dual variables ux,ij for e = (i,j) E E that satisfy the 
conditions (64), (65), and (66) together with Ox defined by (67) as follows: 


u\,ij = Uij, for (i, j) e E\E, (70) 

ux,ij = 0, for i E max(ld), j E min(ld), (q j) ^ E, (71) 

u\,ij = Uij + ^ij, for {i,j) E EnE, (72) 

u\,ij = Sij, for i E max(ld), j E min(ld), {i,j) E E. (73) 


We can easily see that all ux,i/s defined as above are non-negative. 

First, we show that (64) holds. For i E ld\ld, we have 9x,i = 9i according to 
(67), which further implies (64) together with (70), (71) and (62). For i eV with 
9i < Lx, we have 9x,i = Lx and summing (62) and (68) yields (64). For i E V 
with 9i> Lx + A, we have 9x,i = Lx + X and summing (62) and (69) yields (64). 

Second, we show that (65) holds. It suffices to prove that ux,ij = 0 for {i,j) E E 
such that 9x,i < 9x,j, which can only happen when {i,j) E E\E (note that when 
{i,i) E E P[ E, we must have either 9x,i = 9x,j = Lx or 9x,i = 9x,j = Lx + X). In 
this case, we have 9x,i = 9i < 9j = 9x,j- By (70) and (63), (65) holds. 

Third, we show that (66) holds. It suffices to prove that ux,ij = 0 for i G max(I/) 
and j E min(I/) such that 9x,i < 9x,j+X, which can only happen when i G max{V), 
j E min(I4) and {i,j) ^ E. In this case, (66) is implied by (71). 

Then, all the KKT conditions are satisfied by Ox given in (67) and the dual 
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variables defined in (70), (71), (72) and (73). Hence, snch a Ox is an optimal 
solntion for bonnded isotonic regression. □ 

Proof of Theorem S.f. According to Proposition 3.3, when > A > 0, we 

have 

9x^i = max(LA, min(LA + A, 6s)), for i e Us 

where Lx is non-increasing in A. Therefore, the nnmber of connected component 
is non-decreasing in A; so is the divergence of Oxiy)- For \ > 9r — 9i, we have 
^A(y) = ^(y), he., the solntion of the nnbonnded isotonic regression and the 
bonnded isotonic regression are identical. Therefore, the nnmber of connected 
component and the divergence of Ox{y) is a constant when A > 9r — 9\. Combining 
the above two cases on A completes the proof of the theorem. □ 

8.3. Proof of Results from Section 4 

Proof of Lemma 4-5. Suppose —d = X for a A > 0. For any (0,^) satisfying 
-|- BO < c, the objective value of (33) is bounded from below as 

^11^ - y\\l + ~ ^1^2 - ^^(c - BO). 

As a convex function of 0, ^\\0 — y \\2 — X~^{c — BO) is always bounded from below 
for any 0. So is |||^ — y||i + d"'"^. 

Suppose —d 7 ^ A^A for any A > 0. According to Farkas’s lemma (see e.g., Rock- 
afellar (1970, Corollary 22.3.1)), there exists h G such that Ah > 0 and 
—d^h < 0. Given any feasible solution {$,,0) for (33), (^ — th,0) will also be a 
feasible solution for any f > 0, whose objective value is 

i||e - y||l + dT(^ - (h) = i||e - y\\l + - id^h, 

which approaches —oo as t increases to infinity. Therefore, (33) will not have a 
bounded optimal value. □ 

We present the following two lemmas which will be used in the proofs of Lemma 
4.7, Theorem 4.6, Lemma 5.2 and Theorem 5.1. 

Lemma 8.1. Suppose that Q is a convex polyhedron in 1RP+"' defined as (24) 
and (^, 0) G Q. Let 

J := {1 < i < m : (a^,^) -h (bi,0) = cj. 

Then, {^,0) G relint(F), where 

F = {{tO)e : Aj^ + BjO = Cj, A^^ + BjO < Cj}. 
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Proof. Let be the complement set of J, namely, := {1, 2,..., m}\J. By the 
dehning of J, we have Ajc^ + BjcO < Cjc so that there exists a small enough e > 0 
such that Ajc^ + BjcO < Cjc for any (^, 6) G B^{^, 0). According to Lemma 2.1, 


aff(F) = {{t e) G : Aj^ + BjO = Cj} 

so that B^{^,6) naff(F) C F. Hence, by dehnition, {$,,6) G relint(F). □ 

Lemma 8.2. Suppose that Q is a convex polyhedron in 1RP+"' dehned as (24). 
Let 

(0(y),^(y)) e argmin^||6>-yH^ +/(^) (74) 

s.t. A^ + BO < c, 

where /(•) is a convex differentiable function. Then, the optimal solution 6{y) is 
unique for each y. The components of 0{y) are almost differentiable, and is 
an essentially bounded function, for each i = 1,..., n. 

Proof. The uniqueness of 6{y) can be easily shown via a strong convexity argu¬ 
ment. Assume that there are two distinct optimal solutions to (74), (0i(y), ^i(y)) 
and (02(y),^2(y))- Then, the solution ((0i(y) + 02(y))/2, (^i(y) + ^2(y))/2) is 
a feasible solution with strictly smaller objective value, i.e.. 


0i(y) + 02(y) 


+ / 


^i(y) + ^2(y) 


< \ ll^i(y) - y|l2 + ^/(^i(y)) + \ ll^2(y) - ylla + ^/(^2(y)), 


which contradicts the optimality of (0i(y), ^i(y)) and (02(y), ^ 2 (y))' 

The almost differentiability of 0{y) and the essential boundedness of V6*i can be 
proved by a scheme similar to the proof of Proposition 1 in Meyer and Woodroofe 
(2000). In particular, it suffices to prove that 0{y) is Lipschitz continuous, namely, 
ll^>(yi) -0(y2)il2 < ||yi — y 2 || 2 , which further implies the almost differentiability 
of 0{y) by Rademacher’s theorem (Federer (1969)). According to the optimality 
condition of (74), we have 


(yi -^(yi),^(y2) - 0 (yi)) - (v/(i'(yi)),^(y2) -^(yi)) < o, 
(y2 -^(y2),^(yi) -^(y2)) - (v/(i'(y2)),^(yi) -^(y2)) < 0. 


Adding these two inequalities leads to 


(yi - 72 - (§(yi) - §( 72 )), §( 72 ) - §(yi)) 

+ (v/(|(y 2 )) - V/(?(y,)),|(y 2 ) -?(y,)) < 0. 








Chen, X., Lin, Q. and Sen, B./On Degrees of Freedom of Projection Estimators 42 


Since /(•) is convex so that V/(-) is monotone, we have 

(v/(|(y 2 )) - V/(|(yi)),|(y 2 ) -?(y,)) > 0 


which implies 


ll^(yi)-^(y 2)||2 < (y 2 -yi,^(y 2 )- 0 (yi)) 

< ||y2-yi||2||^(y2)-^(yi)||2, 

and thus || 0 (yi) - 0 (y 2 )i |2 < ||yi - y 2 i|. □ 

Proof of Lemma 4-7. Given any face F of Q, Proj 0 (F) + (F)) is itself a 

polyhedron in M"" so that its boundary bd(Proj 0 (F) + i?_d(iV(F))) is a measure 
zero set in ML. Since Q has hnitely many faces, the set 

IJ bd('proj,(F) + R.4N{F))\ (75) 

_F is a face of Q ^ ' 

has a zero measure in M”. Therefore, to prove Lemma 4.7, it suffices to show 
that, for any y not in (75), there is an associated neighborhood G of y such that 
6{z) = 6{z) for every z G G. 

Suppose y is not in (75). Let (^(y),0(y)) be dehned as in (33) and Jy be 
dehned as in (37). We consider a face of Q dehned as 

Fy = {(^, 0) G + Bj^e = + Bj.6 < Cj.}, 

where Jy is the complement set of Jy. According to Lemma 8.1, we have (^(y), 0{y)) € 
relint (Fy). 

Next we want to show 

Fy C argmax(-d,^) + (y - d{y),0). 

nmQ 

According to the KKT conditions of the minimization problem (33) and the 
dehnition of Jy, there exists a Lagrange multiplier A G such that, 

^(y) - y + = 0 , d + Aj^Ajy = 0 , 

A.}yi{y) + Bjfi{y) = cj^, AjcJ(y) + Bjcd{y) < Cjc, 

Ajy > 0, Xjc = 0, 

where Xj^ and Xjc are sub-vectors of A indexed by Jy and Jy, respectively. 
Therefore, the KKT conditions of the maximization problem max(^_ 0 )gQ(—d, -|- 
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(y-^(y),^), namely, 


d{y) - y + = 0 , d + = 0 , 

A^ + Be <c, A > 0 
((a*, 0 + (bi, 0) - Ci)Xi = 0, Vi = 1, 2,... m 

holds for any (^, 0) G Fy with A = A, which implies Fy C arg max^^ 0 )gg(—d, + 
(y — 0(y), 6). By the dehnition of normal cone, we have (— d, y — 6{y)) G N{Fy), 
and thns, y — 6{y) G R-d{N{Fy)). Hence, we have y = (y — ^(y)) + ^(y) ^ 

Projs(Fy) + F-d(A'(F,)). 

Becanse y is not in (75), Proj 0 (Fy) + R_d{N(Fy)) must have a full dimension 
and contain y in its interior. Therefore, there exists a neighborhood U oi y 
contained in int(Proj 0 (Fy) + i?_d(iV(Fy))) such that, for any z G f/, there exist 
(|(z), 0(z)) G Fy with z —0(z) G R-diN (Fy)) . This follows from the fact that for 
any z G H, as z belongs to int(Proj 0 (Fy) + R_d{N (Fy))) , z can be expressed as 
z = 0{z) + (z — 0(z)) where 0{z) G Projd(Fy) and z — 0(z) G R^d{N {Fy)) . Now 
from the dehnition of Projd(Fy), there exists ^(z) such that (|(z),0(z)) G Fy. 

If there exist multiple qualihed ^(z), we choose the one that minimizes ||^(z) — 
^(y)ll 2 - Since z — 0{z) G R^d{N{Fy)), by the dehnition of R^d{N{Fy)), we have 
(—d, z — 0(z)) G N{Fy), which further implies 

Fy C argmax(—d,^) + (z — 0{z),0), 

{€,0)6S 

by the dehnition of N{Fy). Since (|(z),0(z)) G Fy, we have 

(|(z),0(z)) G argmax(-d,^) + (z - 0{z),0) 

(4,0)6S 


which is equivalent to 

(-d,^) + (z - 0{z),0) < (-d,|(z)) + (z - 0{z),0{z)) , 

for any (^, 0) G Q. This implies 

(d,^-|(z)) + (0(z)-z,0-0(z)) >0, 

for any (^, 0) G Q, which, by the optimality conditions (see e.g., Bertsekas, Nedic 
and Ozdaglar (2003, Proposition 4.7.1)) shows that (0(z),|(z)) is an optimal 
solution of 

min -110 — zll, + d'''£. 

(ee)eQ2" "2 

Due to (33) and the uniqueness of the optimal solution of this minimization 
problem in its 0-component, we have 0(z) = 0(z) G Proj0(Fy) for any z G f/. 
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Without loss of generality, we can set ^(z) = ^(z) as well. Since (^(y),0(y)) G 
relint(Fy), by the continuity of ^(■) and 0{-), we can guarantee (^(z),0(z)) G 
relint(Fy) for any z G f/, if is small enough. 

Next, we show that, for all z G f/, 

argmin-||0 — z||2 + = argmin-||0 — z||2 + 

(4,0)6Q 2 (i,e)eFy 2 

= argmin -1|0 — z||2 + (76) 

(^,0)eafr(Fy) 2 

The hrst equality of the above display follows from the fact that (^(z), 0{z)) G Fy. 
We prove the second equality by contradiction. Suppose that the equality does not 
hold. Then, there must exist (^', O') G aff(Fy)\Fy such that ^\\0' — zjll + d''"^' < 
|||0(z) — z||2 + d'''^(z), for some z G t/. Because (^(z),0(z)) G relint(Fy), there 
exists a small enough a > 0 such that a{0',^') + (1 — a)(0(z),^(z)) G Fy and, 
by convexity, 

]r\\a0' + (1 - a)0{z) - z\\l + d’^(a^' + (1 - a)^(z)) 


e ' 

VI 


+ 

1 

^||0(z) -z||2 + d^^(z) 

< -||0(z)-z||^ + d^i(z 

)i 



which leads to a contradiction to the optimality of (^(z), 0{z)) in the hrst equality 
in (76). Therefore, we have (f(z),0(z)) G arg miu/t ^110—z||n+d'''£. Since 

aff(F,) = {(«,e) e R»+” : + Bjfi = c,,q due to Lemma 2.1. Lemma 4,7 

follows. □ 

Proof of Theorem 4-6. The uniqueness of 0{y), the almost differentiability of 
0{y), and the essential boundedness of V6*j can be proved by Lemma 8.2. 
Moreover, Lemma 4.7 implies that for a.e. y G M”, 

D{y) = Vyd(y) = VyS(y), 

where 0{y) is dehned in (39). By the dehnition of Jy, we have 

{(^, 0) G : AjJ + Bj^0 = cj^} = {(^, 0) G + Bj^O = 

so that (0(y),^(y)) in (39) can be equivalently dehned 

(0(z),^(z)) = argmin -z||2 Td"^^ 

S.t. + BtyB = Cly. 


(77) 
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According to the optimality conditions of (77), there exists a Lagrange multiplier 
A(y) G such that, 


e{y) - y + 57^A(y) 

= 0, 

(78) 

d + A)^A(y) 

= 0, 

(79) 

A/y^(y) + Bi^e{y) 

= 

(80) 


We dehne K as a matrix whose columns form a set of basis for the linear 
space ker(A)^) in Hence, 77 is a matrix of order 141 X (141 -rank(A7j). 

Since (77) has a bounded optimal value, according to Lemma 4.5, there exists 
a A G such that —d = Aj^A. Note that (79) shows —d = A)7A(y), which 
implies that A(y) — A G ker(A)7). Therefore, there exists v(y) G ]RlNI-i'ank(Ajy) 
such that A(y) = A + i7v(y). Then, using (78), 

0(y) = y-Hj(A +A:v(y)). (81) 

Using the dehnition of K, multiplying K~^ to both sides of (80), and using the 
previous display, we have 

= K^Bjdiy) 

= K^BJ^{y-Bl{X + K^r{y))) 

= K^Bj^y-K^Bj^BlX-K^Bj^BlKviy). 

We claim that Bj^Bj^K is invertible. Suppose otherwise. Then there exists a 
non-zero vector u G ]Rhy|-rank(A^^) Bj^Bj^Ku = 0, which implies 

BJ^Ku = 0. By the dehnition of 77, Aj^Ku = 0 also. Note that 77u must be 
non-zero as the columns of 77 are linearly independent. However, this means that 
u^K^[Ai^,Bi^]=0, contradicting the fact that 7y is chosen so that the rows of 
the matrix [A/y,7?/y] are independent. Therefore, Bj^Bj^K must be invertible 
so that (82) implies 

v(y) = K^B,^y 

- - {k^B.^bIkY'K^B,^B]\. 

Plugging in v(y) into (81), we have 


0{y) = y - B]K BjBJK Bry + c'. 


-1 


(82) 
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where c' is a constant vector not depending on y. Therefore, 

D{y) = Vyg(y) = Vy§(y) 

= trace - BlK 

= n-{\Iy\-mnk{A]^)), 

which completes the proof. □ 

Proof of Proposition f.S. Note that, an equivalent formulation of the LSE given 
in (30) is a special case of (33) when d = 0. Since each feasible solution of (30) 
must satisfy — 0 = 0, Jy, as dehned in (37), includes all the constraints of 
(30) and Aj^ = and Bj^ = [—Since Bj^ contains In, all 

the rows of [Ajy,Bj^] are linear independent and thus Jy = Jy with |/y| = n. 
According to Theorem 4.6, for a.e. y, we have 

df(X/3(y)) = df(0(y)) = n — \Iy\ + E[rank(y4/y)] = rank(X). 

□ 


Proof of Proposition f.O. Letting ^ = {(3^ and 0 = X(3 in (34), the gen¬ 

eralized Lasso problem can be reformulated as a special case of (33) as shown in 
(36). We partition {1,2,...,/} into three sets of indexes as: 

/+ := {i : d 73 (y) > 0 }, /_ := {i : d 73 (y) < 0 }, Jq := {i : d 73 (y) = 0 }. 


According to the constraints DjJ — 7 < 0 and —Df3 — 7 < 0 in (36), the 
optimality of 7 j(y) will ensure 7 j(y) = max(d7/3(y), —d7/3(y)), which implies 
that d73(y) - 7 i(y) = 0 for i G /+ U/q and -d73(y) - 7i(y) = 0 for i G I-UIq. 


We dehne 77+, and Dq as the sub-matrices of D consisting of the rows 

of D indexed by J+, J_ and Jq, respectively. By ordering ^ = (/ 3 ^, 7 ^)^ = 
i(3'^, we can represent the matrices Aj^ and Bj^ as 


/ X 0 0 0 \ 

-X 0 0 0 

D+ -I 0 0 

0-/0 
Do 0 0 -I 

V -Do 0 0 -/ / 


and Djy 


/ -I \ 
I 
0 
0 
0 


V 0 


Let Do be the sub-matrix of Do that contains the maximum number of linearly 
independent rows of Do- Suppose Do has I rows. We have 


Ar, = 


/ X 0 0 

D+ -I 0 
-D_ 0 -I 

Do 0 0 

V -Do 0 0 


0 

0 

0 

-I 


\ 


and Br = 


[-h 0 ] / 


/-n 

0 
0 
0 

V 0 / 
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Therefore, |/y| = n + |/+| + |/_ | + |/o| + rank(Do) and rank(y4/y) = |/+| + |/_| + 
|/o| + rank([X''', -D(]^]). Let be an. {d — 1) x d matrix whose rows form a basis 
of the linear space ker(Zi)o)- Then [(Dq)^, Dq] becomes a d x d invertible matrix. 
Hence, 


rank 


/ 

■ X ■ 

\ , ( 

■ X ■ 



1 = rank 1 

V 

. ^0 

. -^0 


= rank 



A'(£>5'T 


= rank(X(Zi)Q)^) + rank(l3oL^o") 
= rank(X(Zi)Q)''') + rank(Zi)o)- 


According to Theorem 4.6, for a.e. y, we have 


df(X/3(y)) 


df(0(y)) 

n — |/y| + E[rank(A/y)] 

E[rank(X(5;))^)] 

E [dim (Xker (Do)) ] • 


□ 

Proof of Corollary f.lO. In the special case of (34) with D = Id, the matrix Dq 
in Corollary 4.9 consists of the rows of Id indexed by Jo, which is essentially a 
projection matrix from to the coordinates indexed by Jq. Therefore, ker(Do) = 
{x G : Xj = 0, Vi G Jo} so that dim(Xker(Do))=rank(Xjc) and the conclnsion 
follows. □ 


8.4- Proof of Results from Section 5 

Proof of Lemma 5.2. Note that Lemma 4.7 cannot be rednced to Lemma 5.2, and 
vice versa dne to difference between the linear pertnrbation term (33) and 
the quadratic perturbation term |||^||2 in (45). Therefore, a technique different 
from that used in the proof of Lemma 4.7 needs to be developed in order to prove 
Lemma 5.2. 

As we discussed in Section 5, without loss of generality, we assume A = 1. 
Given any face F of Q, R{F + N{F)) is itself a polyhedron in MP so that its 
boundary hd{R{F + N{F))) is a measure zero set in M". Since Q has finitely 
many faces, the set 

IJ bd('D(F + iV(F)) 

F is a face of Q ^ 


( 83 ) 
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is a measure zero set in M”. Therefore, to prove Lemma 5.2, it suffices to prove 
that, for any y G M"" not in the set (83), there is an associated neighborhood U 
of y such that for every z & U, Ox{z) = Ox{z). 

Note that (83) takes a very different structure from the set (75) considered in 
the proof of Lemma 4.7. 

For y not in the set (83), let Ox{y) and ^^(y) be defined as in (45) and Jy be 
dehned as in (46). We consider a face Fy of Q dehned by 

Fy = {(^, 6) e + Bj^e = Cj^, Aj.i + Bj.e < Cj.}, 

where Jy represents the complement set of Jy. According to Lemma 8.1, we have 
(^(y),^(y)) e rehnt(Fy). 

When A = 1, (45) represents a projection of (O"'^,y"'^)''^ onto Q. By a sim¬ 
ilar proof to Lemma 2.4 based on the KKT conditions of (45), we can show 
(?A(y)'^>^A(y)’^)’^ e Fy and (-iA(y)"^>y"^ - ^A(y)’^)’^ e N{Fy), which further 
implies (0'^,y'^)'^ ^ -^y + ^{Py) y G R{Fy + N{Fy)). 

Because y is not in (83), R{Fy -|- N{Fy)) must have a full dimension and con¬ 
tain y in its interior. Therefore, there exists a neighborhood f/ of y such that, for 
every z G U, (^^( 2)5 ^a(z)) G Fy and (—^^( 2)1 2 ~^a(z)) G N{Fy). We claim that 
U can be further chosen such that, for every z G U, (^^( 2 )) ^a(z)) G rehnt(Fy). 
If not, there exists a sequence of {zfc}fc>i C R{Fy + N{Fy)) converging to y 
but (^^(zfc);^A(zfc)) G relbd(Fy) for all k. Because (^a(')!^a(')) is a continuous 
mapping and relbd(Fy) is a closed set, we have (^A(y)) ^A(y)) G relbd(Fy), con¬ 
tradicting with the fact that (^A(y)) ^A(y)) € relint(Fy). Thus, (^;^(z), 0 a(z)) G 
relint (Fy) for all z G t/. 

Next we show that for all z G U, 

(?a(z),^a(z)) = argmin|| 6 > - z ||2 + \\^\\l 
(0.€)6Q 

= arg min ||0 — ZII 2 + 11^112 

= argmin ||0 — z|| 2 -|-||^|| 2 - 

( 04 )eaff(Fy) 

The second equality holds because (^a(z )5 ^a(z)) G Fy C Q. Suppose that the 
third equality does not hold. Then there must exist {0',^') G aff(Fy)\Fy such 

that \\e' - z\\l -F ll^'lll < ||0 a(z) - z||^ -F ||iA(z)||i- However, since (0 a(z),^^a(z)) 
is an interior point of Fy, there exists a small enough a > 0 such that a{6', ^') -|- 
(1 - a)(0A(z),?A(z)) e Fy and 

WaO' + (1 - a)dx{z) - zi|2 + |K + (1 - «)?a(z)||^ < ||0a(z) - z||^ + |||,(z)||i 

which leads to a contradiction. According to Lemma 2.1, aff(Fy) = {(^,0) G 
]^P+n . ^ j^Q — cjy}, which means that (0 a(z), ^a(z)) is an optimal solution 
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of (49). As a result, 0x{z) = Ox{z) for each z G 17, by the uniqueness of the 
optimal solution of (49). □ 

Proof of Theorem 5.1. The uniqueness of 0(y), the almost differentiability of 
0(y), and the essential boundedness of V6*j can be proved by Lemma 8.2. 

By the dehnition of Jy, we have 

{(^, 6) e : Aj^i + Bj^e = CjJ = {(^, 6) G 
so that (0 a(z),4a( 2) ) (49) can be equivalently dehned as 


^a(z),^a(z) 


= argmin -||0 — z 

0,4 2 

s.t. A 


,2 

I 2 +211^1 


+ Bi^e - c,„. 


(84) 


For notational simplicity, we write the index set ly as / and with a slight abuse of 
notation, let In denote n x n identity matrix. By Lemma 5.2, D{y) = Vy9x{y) = 
WyOxiy). To compute VyOxiy), we introduce the dual variable w G for 
the equality constraint and write down the KKT conditions of the optimization 
problem in (84) at the point y: 

0 — y + BJw = 0, 

+ AJw = 0, 

Aj^ + Bj6-c = 0 . 


We note that w is unconstrained and the complementary slackness condition is 
not relevant due to the equality constraint. Different from the proof of Theorem 
4.6, we will not hrst derive a closed form like (82) for 0{y). Instead, we will 
directly characterize Vy6x{y) by applying the implicit function theorem to this 
KKT condition. 

Given the system of equations in the KKT conditions, the corresponding Jaco¬ 
bian matrix with respect to (0,^,w) at the optimal solution takes the following 
form: 


/In 0 Bj\ 

J(0,^,w) = 0 XInd AJ , (85) 

\Bi Aj 0 / 

and the Jacobian matrix with respect to y takes the following form. 



-In 

0 

0 
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Let 7 = [0,^,w] G The implicit function theorem implies that at the 

optimal primal and dual solutions, 

which further implies that 

Vj^0A(y) = -tr ([J(0,^,w)"V(y)](l : n, 1 : n)) , (86) 

where [J(0, w)“^ J(y)](l : n, 1 : n) denotes the top-left nxn sub-matrix of 

</(0,^,w)-V(y). 

Due to the special structure of J{6, w) in (85), its inversion can be computed 
analytically. In particular, let Dj = BjBj -|- . We note that Dg is an 

invertible matrix since the matrix [^4/, Bj] has full row rank. The inversion of 
J(0,^,w) takes the following form: 



By plugging in the above formula for the inverse of J(0, w) in (86), we obtain 
the divergence in (47), which completes the proof. □ 

Proof of Corollary 5.4- By setting ^ = f3 and 6 = Xf3, (53) can be reformulated 
as a special case of (45), i.e., 

(Ox{y):^x{y)) = argmin^||6>-y||2 +^||^||2 (87) 

6»,4 ^ ^ 

s.t. — 0 <0 

-x^ + e<o. 

Since each feasible solution of (87) must satisfy X^ — 6 = 0, Jy includes all the 
constraints of (87) and thus Aj^ = [W", — W'"] and Bj^ = [—In, In]^ ■ It is easy 
to see that Aj^ = X and Bj^ = —In- According to Theorem 5.1, for a.e. y G M"', 
we have 

df(x3,(y)) = df(0A(y)) 

= n — trace -|- —XX^ 

= n — trace (/„) -|- trace (AJ^ -|- X^X^ ^ 

= trace (^X {Xh + X^X) X^) , 

where the third equality is due to the Sherman-Morrison-Woodbury formula. □ 



dfi 
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